Skip to content

Senior Site Reliability Engineer

210k – 240kSan Francisco, CADevOps / SREOnsite8+ YOE
Summary

Builds and maintains scalable infrastructure for real-time analytics and ML workloads, focusing on reliability, automation, CI/CD, monitoring, and incident response. Requires 8+ years SRE/DevOps experience with Kubernetes, Terraform, Linux, and observability tools.

About the role

Key Responsibilities

  • Design, build, and maintain scalable infrastructure to support real-time analytics and machine learning workloads
  • Improve system reliability and performance through automation, observability, and proactive capacity planning
  • Own and evolve CI/CD pipelines, deployment automation, rollback mechanisms, and config management
  • Implement and maintain monitoring, alerting, and incident response processes (SLOs, runbooks, on-call rotations)
  • Collaborate across engineering and data science teams to drive a culture of performance and reliability
  • Ensure security, compliance, and operational readiness across our cloud infrastructure
  • Drive post-incident analysis and continuous improvement initiatives

What Will Help You Succeed

  • 8+ years of experience in SRE, DevOps, or infrastructure engineering roles
  • 5+ years of experience with datacenter operations and/or system and network administration
  • Experience with containerization (Docker), and orchestration (Kubernetes)
  • Strong knowledge of Linux systems, networking, and systems performance tuning
  • Solid understanding of infrastructure-as-code (e.g., Terraform, Ansible)
  • Good programming skills and ability to apply sound coding principles to IaC and scripting code with languages such as Terraform, Ansible, Bash (shell scripting), and/or Python
  • Experience with monitoring and observability stacks (e.g., Prometheus, Grafana, Datadog, ELK, OpenTelemetry)
  • Proficiency with CI/CD tools and pipelines (e.g., GitHub Actions, ArgoCD, etc.)
  • Ability to debug complex systems and automate solutions in scripting languages
  • Excellent communication skills and the ability to work cross-functionally

Nice-to-Have

  • Experience with cloud and managed services (e.g. AWS)
  • Experience supporting data-intensive platforms (Spark, Airflow, Kafka, etc.)
  • Familiarity with security practices for cloud-native applications and infrastructure
  • Experience in high-compliance or SOC-2 environments
Skills
KubernetesDockerTerraformAnsibleLinuxPythonBashPrometheusGrafanaDatadogGitHub ActionsArgoCDAWSOpenTelemetry
Similar roles at this salary range
All DevOps / SRE jobs →
Plaid

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3