Senior Site Reliability Engineer
210k – 240kSan Francisco, CADevOps / SREOnsite8+ YOE
Summary
Builds and maintains scalable infrastructure for real-time analytics and ML workloads, focusing on reliability, automation, CI/CD, monitoring, and incident response. Requires 8+ years SRE/DevOps experience with Kubernetes, Terraform, Linux, and observability tools.
About the role
Key Responsibilities
- Design, build, and maintain scalable infrastructure to support real-time analytics and machine learning workloads
- Improve system reliability and performance through automation, observability, and proactive capacity planning
- Own and evolve CI/CD pipelines, deployment automation, rollback mechanisms, and config management
- Implement and maintain monitoring, alerting, and incident response processes (SLOs, runbooks, on-call rotations)
- Collaborate across engineering and data science teams to drive a culture of performance and reliability
- Ensure security, compliance, and operational readiness across our cloud infrastructure
- Drive post-incident analysis and continuous improvement initiatives
What Will Help You Succeed
- 8+ years of experience in SRE, DevOps, or infrastructure engineering roles
- 5+ years of experience with datacenter operations and/or system and network administration
- Experience with containerization (Docker), and orchestration (Kubernetes)
- Strong knowledge of Linux systems, networking, and systems performance tuning
- Solid understanding of infrastructure-as-code (e.g., Terraform, Ansible)
- Good programming skills and ability to apply sound coding principles to IaC and scripting code with languages such as Terraform, Ansible, Bash (shell scripting), and/or Python
- Experience with monitoring and observability stacks (e.g., Prometheus, Grafana, Datadog, ELK, OpenTelemetry)
- Proficiency with CI/CD tools and pipelines (e.g., GitHub Actions, ArgoCD, etc.)
- Ability to debug complex systems and automate solutions in scripting languages
- Excellent communication skills and the ability to work cross-functionally
Nice-to-Have
- Experience with cloud and managed services (e.g. AWS)
- Experience supporting data-intensive platforms (Spark, Airflow, Kafka, etc.)
- Familiarity with security practices for cloud-native applications and infrastructure
- Experience in high-compliance or SOC-2 environments
Skills
KubernetesDockerTerraformAnsibleLinuxPythonBashPrometheusGrafanaDatadogGitHub ActionsArgoCDAWSOpenTelemetry
Similar roles at this salary range
All DevOps / SRE jobs →Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.
194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE