Senior/Staff Site Reliability Engineer
175k – 230kNew York, NYDevOps / SREHybrid7+ YOE
Summary
Leads design, operation, and evolution of highly reliable, scalable production infrastructure including cloud, databases, and observability. Drives incident response, SRE practices, automation, and capacity planning for large-scale distributed systems. Requires 7-12+ years in SRE/infrastructure engineering.
About the role
Responsibilities
- Design and evolve highly reliable system architectures, ensuring high availability, fault tolerance, and scalability across Sage's production infrastructure.
- Lead complex incident response efforts, coordinating across engineering teams to quickly diagnose and resolve production issues while driving thorough post-incident reviews and long-term reliability improvements.
- Define and implement organization-wide observability practices, including metrics, logging, tracing, and actionable alerting to ensure strong visibility into system health.
- Establish and maintain reliability standards, including defining SLIs, SLOs, and error budgets, and partnering with engineering teams to integrate these practices into the software development lifecycle.
- Drive automation and infrastructure improvements that reduce operational toil and improve the efficiency and reliability of deployments, monitoring, and operational workflows.
- Partner with engineering teams on system design and architecture reviews, ensuring reliability, scalability, and operational best practices are considered early in the development process.
- Evolve Sage's cloud infrastructure, including networking, compute, storage, and security practices to support scalable and resilient systems.
- Operate and improve critical data infrastructure, ensuring high availability, performance, backup strategies, and disaster recovery processes for production databases.
- Lead capacity planning and auto-scaling efforts, ensuring infrastructure and systems scale effectively as product usage grows.
- Build internal tooling and platforms that improve the developer experience, simplify debugging, and enable safer and more reliable deployments.
Qualifications
- 7-12+ years of experience in software engineering, infrastructure engineering, or site reliability engineering, operating large-scale distributed systems in production.
- Experience operating and supporting edge or device-based systems, including managing connectivity, observability, remote updates, and reliability for distributed hardware deployments such as IoT or field devices.
- Strong networking fundamentals, including experience debugging distributed system issues across load balancers, DNS, TLS, and VPC networking within platforms like Amazon Virtual Private Cloud or similar cloud networking environments.
- Experience operating and scaling production databases, including performance tuning, replication, backup/recovery strategies, and high availability for systems such as PostgreSQL, MySQL, or distributed databases.
- Deep expertise in cloud infrastructure, such as Amazon Web Services or Google Cloud Platform.
- Strong experience designing and operating highly available systems, including strategies for redundancy, failover, disaster recovery, and capacity planning.
- Expertise in containerization and orchestration, particularly with Kubernetes and modern container platforms.
- Advanced observability and monitoring skills, using tools such as Datadog, Prometheus or Grafana.
- Strong programming ability in languages commonly used for infrastructure and reliability engineering (Go, Python, or Java), with experience building internal tooling and automation.
- Deep knowledge of infrastructure-as-code practices, including tools like Terraform or Pulumi.
- Proven experience leading reliability initiatives, such as defining SLOs/SLIs, improving incident response processes, and driving post-incident reviews.
- Ability to influence engineering teams across the organization, guiding best practices for reliability, scalability, and operational excellence.
- Strong incident management and production debugging skills, with experience coordinating responses to complex outages and improving long-term system resilience.
Preferred Qualifications
- Experience introducing and scaling SRE practices in early-stage or high-growth organizations, helping transition teams from reactive operations to proactive reliability engineering.
- Experience designing disaster recovery and business continuity strategies, including multi-region deployments, backup validation, and recovery testing for critical systems.
Benefits and Pay
- Expected annual salary range: $175,000-$230,000 USD, depending on level of expertise, experience, and interview performance.
- Competitive base compensation along with stock options.
- Fully-paid health, dental, and vision insurance, plus other health benefits.
- Take as you need time off policy, 7 paid holidays, and company-wide winter break.
Skills
KubernetesAWSGoogle CloudTerraformPulumiDatadogPrometheusGrafanaGoPythonPostgreSQLMySQLSLOsSLIsInfrastructure as Code
Similar roles at this salary range
All DevOps / SRE jobs →Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.
194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE