Senior/Staff Site Reliability Engineer

175k – 230kNew York, NYDevOps / SREHybrid7+ YOEApr 30

Summary

Leads design, operation, and evolution of highly reliable, scalable production infrastructure including cloud, databases, and observability. Drives incident response, SRE practices, automation, and capacity planning for large-scale distributed systems. Requires 7-12+ years in SRE/infrastructure engineering.

About the role

Responsibilities

Design and evolve highly reliable system architectures, ensuring high availability, fault tolerance, and scalability across Sage's production infrastructure.
Lead complex incident response efforts, coordinating across engineering teams to quickly diagnose and resolve production issues while driving thorough post-incident reviews and long-term reliability improvements.
Define and implement organization-wide observability practices, including metrics, logging, tracing, and actionable alerting to ensure strong visibility into system health.
Establish and maintain reliability standards, including defining SLIs, SLOs, and error budgets, and partnering with engineering teams to integrate these practices into the software development lifecycle.
Drive automation and infrastructure improvements that reduce operational toil and improve the efficiency and reliability of deployments, monitoring, and operational workflows.
Partner with engineering teams on system design and architecture reviews, ensuring reliability, scalability, and operational best practices are considered early in the development process.
Evolve Sage's cloud infrastructure, including networking, compute, storage, and security practices to support scalable and resilient systems.
Operate and improve critical data infrastructure, ensuring high availability, performance, backup strategies, and disaster recovery processes for production databases.
Lead capacity planning and auto-scaling efforts, ensuring infrastructure and systems scale effectively as product usage grows.
Build internal tooling and platforms that improve the developer experience, simplify debugging, and enable safer and more reliable deployments.

Qualifications

7-12+ years of experience in software engineering, infrastructure engineering, or site reliability engineering, operating large-scale distributed systems in production.
Experience operating and supporting edge or device-based systems, including managing connectivity, observability, remote updates, and reliability for distributed hardware deployments such as IoT or field devices.
Strong networking fundamentals, including experience debugging distributed system issues across load balancers, DNS, TLS, and VPC networking within platforms like Amazon Virtual Private Cloud or similar cloud networking environments.
Experience operating and scaling production databases, including performance tuning, replication, backup/recovery strategies, and high availability for systems such as PostgreSQL, MySQL, or distributed databases.
Deep expertise in cloud infrastructure, such as Amazon Web Services or Google Cloud Platform.
Strong experience designing and operating highly available systems, including strategies for redundancy, failover, disaster recovery, and capacity planning.
Expertise in containerization and orchestration, particularly with Kubernetes and modern container platforms.
Advanced observability and monitoring skills, using tools such as Datadog, Prometheus or Grafana.
Strong programming ability in languages commonly used for infrastructure and reliability engineering (Go, Python, or Java), with experience building internal tooling and automation.
Deep knowledge of infrastructure-as-code practices, including tools like Terraform or Pulumi.
Proven experience leading reliability initiatives, such as defining SLOs/SLIs, improving incident response processes, and driving post-incident reviews.
Ability to influence engineering teams across the organization, guiding best practices for reliability, scalability, and operational excellence.
Strong incident management and production debugging skills, with experience coordinating responses to complex outages and improving long-term system resilience.

Preferred Qualifications

Experience introducing and scaling SRE practices in early-stage or high-growth organizations, helping transition teams from reactive operations to proactive reliability engineering.
Experience designing disaster recovery and business continuity strategies, including multi-region deployments, backup validation, and recovery testing for critical systems.

Benefits and Pay

Expected annual salary range: $175,000-$230,000 USD, depending on level of expertise, experience, and interview performance.
Competitive base compensation along with stock options.
Fully-paid health, dental, and vision insurance, plus other health benefits.
Take as you need time off policy, 7 paid holidays, and company-wide winter break.

Skills

KubernetesAWSGoogle CloudTerraformPulumiDatadogPrometheusGrafanaGoPythonPostgreSQLMySQLSLOsSLIsInfrastructure as Code

Similar roles at this salary range

All DevOps / SRE jobs →

Plaid

Jun 19

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Apply