Site Reliability Engineer - Platform Infrastructure Engineering

169k – 193kMountain View, CAMcLean, VADevOps / SREOnsite3+ YOEMay 5

Summary

Site Reliability Engineer builds automation, observability, and infrastructure to ensure scalable, secure platform operations in AI-accelerated environments. Requires 3+ years SRE/DevOps experience, cloud expertise, programming proficiency, and bachelor's degree.

About the role

Responsibilities

Build and maintain automated reliability tooling, infrastructure as code, and observability systems that enhance uptime and service performance.
Develop monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, OpenTelemetry) to detect and remediate issues proactively.
Implement automated architectural reviews and reliability guardrails for agent-developed applications to ensure machine-generated code meets long-term maintainability and performance standards.
Partner with engineering teams to design and implement scalable, fault-tolerant systems that meet defined SLIs and SLOs.
Automate repetitive operational tasks and develop self-healing and auto-remediation mechanisms to minimize human intervention.
Participate in on-call rotations and lead incident response efforts, performing post-incident reviews and driving systemic improvements.
Improve the deployment and release process using CI/CD pipelines and progressive delivery techniques to ensure stability and safety.
Champion observability, reliability, and operational readiness reviews as part of the development process.
Collaborate with Security and Compliance teams to ensure production systems meet FedRAMP, NIST, and internal policy requirements.
Contribute to documentation, runbooks, and internal tooling to enhance knowledge sharing and operational maturity across teams.

Minimum Qualifications

Bachelor’s degree in Computer Science, Software Engineering, or a related technical field.
3-5 years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
2+ years of hands-on experience managing and scaling services in cloud environments such as AWS, GCP, or Azure.
1+ years proficiency in at least one modern programming language (e.g., Java, Go, Python, Ruby, JavaScript).

Preferred Qualifications

Strong understanding of containerization and orchestration technologies (Docker, Kubernetes).
Experience implementing and maintaining CI/CD pipelines and automation frameworks.
Working knowledge of observability systems—metrics, tracing, logging, and alerting.
Experience building automated recovery, failover, or chaos-engineering systems to validate reliability.
Familiarity with event-driven architecture and asynchronous processing systems.
Knowledge of distributed systems design, load balancing, and performance optimization.
Exposure to infrastructure-as-code tools (Terraform, Pulumi, Ansible) and GitOps practices.
Understanding of security and compliance frameworks (FedRAMP, SOC2, or NIST 800-53).
Strong analytical and troubleshooting skills across the stack—from network to application layer.
Excellent communication and documentation skills, with a focus on cross-team collaboration and continuous improvement.
Experience using AI agentic coding assistants and deploying custom AI agents or automated workflows into production environments.

Compensation

Base salary: $168,926—$192,500 USD (Mountain View, CA)
Comprehensive benefits including medical, dental, vision, 401(k) match, unlimited PTO, parental leave, and more.

Skills

KubernetesDockerPrometheusGrafanaOpenTelemetryTerraformPulumiAnsibleAWSGCPAzurePythonGoJavaCI/CD pipelines

Similar roles at this salary range

All DevOps / SRE jobs →

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Grow Therapy

Jun 16

Senior Platform Reliability Engineer

Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.

182k – 250kSan Francisco, CA +2DevOps / SREHybrid6+ YOEAWSEKS

Apply