Skip to content

Site Reliability Engineer - Platform Infrastructure Engineering

169k – 193kMountain View, CAMcLean, VADevOps / SREOnsite3+ YOE
Summary

Site Reliability Engineer builds automation, observability, and infrastructure to ensure scalable, secure platform operations in AI-accelerated environments. Requires 3+ years SRE/DevOps experience, cloud expertise, programming proficiency, and bachelor's degree.

About the role

Responsibilities

  • Build and maintain automated reliability tooling, infrastructure as code, and observability systems that enhance uptime and service performance.
  • Develop monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, OpenTelemetry) to detect and remediate issues proactively.
  • Implement automated architectural reviews and reliability guardrails for agent-developed applications to ensure machine-generated code meets long-term maintainability and performance standards.
  • Partner with engineering teams to design and implement scalable, fault-tolerant systems that meet defined SLIs and SLOs.
  • Automate repetitive operational tasks and develop self-healing and auto-remediation mechanisms to minimize human intervention.
  • Participate in on-call rotations and lead incident response efforts, performing post-incident reviews and driving systemic improvements.
  • Improve the deployment and release process using CI/CD pipelines and progressive delivery techniques to ensure stability and safety.
  • Champion observability, reliability, and operational readiness reviews as part of the development process.
  • Collaborate with Security and Compliance teams to ensure production systems meet FedRAMP, NIST, and internal policy requirements.
  • Contribute to documentation, runbooks, and internal tooling to enhance knowledge sharing and operational maturity across teams.

Minimum Qualifications

  • Bachelor’s degree in Computer Science, Software Engineering, or a related technical field.
  • 3-5 years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
  • 2+ years of hands-on experience managing and scaling services in cloud environments such as AWS, GCP, or Azure.
  • 1+ years proficiency in at least one modern programming language (e.g., Java, Go, Python, Ruby, JavaScript).

Preferred Qualifications

  • Strong understanding of containerization and orchestration technologies (Docker, Kubernetes).
  • Experience implementing and maintaining CI/CD pipelines and automation frameworks.
  • Working knowledge of observability systems—metrics, tracing, logging, and alerting.
  • Experience building automated recovery, failover, or chaos-engineering systems to validate reliability.
  • Familiarity with event-driven architecture and asynchronous processing systems.
  • Knowledge of distributed systems design, load balancing, and performance optimization.
  • Exposure to infrastructure-as-code tools (Terraform, Pulumi, Ansible) and GitOps practices.
  • Understanding of security and compliance frameworks (FedRAMP, SOC2, or NIST 800-53).
  • Strong analytical and troubleshooting skills across the stack—from network to application layer.
  • Excellent communication and documentation skills, with a focus on cross-team collaboration and continuous improvement.
  • Experience using AI agentic coding assistants and deploying custom AI agents or automated workflows into production environments.

Compensation

  • Base salary: $168,926—$192,500 USD (Mountain View, CA)
  • Comprehensive benefits including medical, dental, vision, 401(k) match, unlimited PTO, parental leave, and more.
Skills
KubernetesDockerPrometheusGrafanaOpenTelemetryTerraformPulumiAnsibleAWSGCPAzurePythonGoJavaCI/CD pipelines
Similar roles at this salary range
All DevOps / SRE jobs →
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3
Grow Therapy

Senior Platform Reliability Engineer

Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.

182k – 250kSan Francisco, CA +2DevOps / SREHybrid6+ YOEAWSEKS