Skip to content

Senior/Staff Site Reliability Engineer

175k – 230kNew York, NYDevOps / SREHybrid7+ YOE
Summary

Leads design, operation, and evolution of highly reliable, scalable production infrastructure including cloud, databases, and observability. Drives incident response, SRE practices, automation, and capacity planning for large-scale distributed systems. Requires 7-12+ years in SRE/infrastructure engineering.

About the role

Responsibilities

  • Design and evolve highly reliable system architectures, ensuring high availability, fault tolerance, and scalability across Sage's production infrastructure.
  • Lead complex incident response efforts, coordinating across engineering teams to quickly diagnose and resolve production issues while driving thorough post-incident reviews and long-term reliability improvements.
  • Define and implement organization-wide observability practices, including metrics, logging, tracing, and actionable alerting to ensure strong visibility into system health.
  • Establish and maintain reliability standards, including defining SLIs, SLOs, and error budgets, and partnering with engineering teams to integrate these practices into the software development lifecycle.
  • Drive automation and infrastructure improvements that reduce operational toil and improve the efficiency and reliability of deployments, monitoring, and operational workflows.
  • Partner with engineering teams on system design and architecture reviews, ensuring reliability, scalability, and operational best practices are considered early in the development process.
  • Evolve Sage's cloud infrastructure, including networking, compute, storage, and security practices to support scalable and resilient systems.
  • Operate and improve critical data infrastructure, ensuring high availability, performance, backup strategies, and disaster recovery processes for production databases.
  • Lead capacity planning and auto-scaling efforts, ensuring infrastructure and systems scale effectively as product usage grows.
  • Build internal tooling and platforms that improve the developer experience, simplify debugging, and enable safer and more reliable deployments.

Qualifications

  • 7-12+ years of experience in software engineering, infrastructure engineering, or site reliability engineering, operating large-scale distributed systems in production.
  • Experience operating and supporting edge or device-based systems, including managing connectivity, observability, remote updates, and reliability for distributed hardware deployments such as IoT or field devices.
  • Strong networking fundamentals, including experience debugging distributed system issues across load balancers, DNS, TLS, and VPC networking within platforms like Amazon Virtual Private Cloud or similar cloud networking environments.
  • Experience operating and scaling production databases, including performance tuning, replication, backup/recovery strategies, and high availability for systems such as PostgreSQL, MySQL, or distributed databases.
  • Deep expertise in cloud infrastructure, such as Amazon Web Services or Google Cloud Platform.
  • Strong experience designing and operating highly available systems, including strategies for redundancy, failover, disaster recovery, and capacity planning.
  • Expertise in containerization and orchestration, particularly with Kubernetes and modern container platforms.
  • Advanced observability and monitoring skills, using tools such as Datadog, Prometheus or Grafana.
  • Strong programming ability in languages commonly used for infrastructure and reliability engineering (Go, Python, or Java), with experience building internal tooling and automation.
  • Deep knowledge of infrastructure-as-code practices, including tools like Terraform or Pulumi.
  • Proven experience leading reliability initiatives, such as defining SLOs/SLIs, improving incident response processes, and driving post-incident reviews.
  • Ability to influence engineering teams across the organization, guiding best practices for reliability, scalability, and operational excellence.
  • Strong incident management and production debugging skills, with experience coordinating responses to complex outages and improving long-term system resilience.

Preferred Qualifications

  • Experience introducing and scaling SRE practices in early-stage or high-growth organizations, helping transition teams from reactive operations to proactive reliability engineering.
  • Experience designing disaster recovery and business continuity strategies, including multi-region deployments, backup validation, and recovery testing for critical systems.

Benefits and Pay

  • Expected annual salary range: $175,000-$230,000 USD, depending on level of expertise, experience, and interview performance.
  • Competitive base compensation along with stock options.
  • Fully-paid health, dental, and vision insurance, plus other health benefits.
  • Take as you need time off policy, 7 paid holidays, and company-wide winter break.
Skills
KubernetesAWSGoogle CloudTerraformPulumiDatadogPrometheusGrafanaGoPythonPostgreSQLMySQLSLOsSLIsInfrastructure as Code
Similar roles at this salary range
All DevOps / SRE jobs →
Plaid

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3