Senior Site Reliability Engineer
156k – 288kAtlanta, GAAustin, TXSan Francisco, CASeattle, WARemote6+ YOE
Summary
Senior SRE ensures reliability, performance, and scalability of cloud infrastructure for edge-to-cloud database tech. Leads incident management, builds observability with Datadog/Prometheus/Grafana, implements IaC with Terraform/Helm, and automates resilience on AWS/GCP/Azure. Requires 6+ years SRE/DevOps experience.
About the role
Responsibilities
- Develop and maintain observability solutions using platforms like Datadog, Prometheus and Grafana
- Take a leading role in incident management, including coordinating response efforts, troubleshooting issues, and identifying follow-up actions
- Partner with product engineering teams to architect reliable systems, recover from incidents, and learn from mistakes
- Work with teams to implement and maintain SLOs, monitoring, and alerting strategies that ensure reliability at scale
- Design and implement automation and support tooling to improve system resilience, maintain operational safety and reduce operational overhead
- Lead the development and maintenance of runbooks, alert definitions, and incident response procedures
- Participate in on-call rotations to provide 24/7 support for critical production systems
Requirements
- 6+ years of experience in Site Reliability Engineering or similar DevOps roles focused on system reliability and incident management
- Strong experience with modern monitoring stacks including Prometheus, Grafana, and Datadog
- Experience in at least one systems programming language, such as Python, Go, Rust, C/C++, or Java
- Expertise with Infrastructure as Code tools, like Terraform and Helm
- Expertise with at least one major cloud service provider (AWS, GCP, Azure)
- Strong communication skills, with the ability to lead incident response and effectively collaborate across teams
- Willingness and experience engaging with on-call rotations and emergency response procedures
- A high degree of agency and bias towards action. Identify problems and work autonomously to solve them
- Excellent problem-solving skills and a methodical approach to troubleshooting complex issues
Nice to Have
- Experience building multi-tenant, multi-cloud SaaS/DBaaS Platforms
- 4+ years of hands-on experience architecting applications for Cloud Platforms, and managing Cloud based infrastructure
- Knowledge of edge computing or mesh networking
- Experience instrumenting advanced observability practices (tracing, profiling) in distributed systems
- Experience working with globally distributed teams
- Proven experience in project management
Skills
PrometheusGrafanaDatadogTerraformHelmAWSGCPAzurePythonGo
Similar roles at this salary range
All DevOps / SRE jobs →Site Reliability Engineer
Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.
127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS