Site Reliability Engineer - Platform Infrastructure Engineering
169k – 193kMountain View, CAMcLean, VADevOps / SREOnsite3+ YOE
Summary
Site Reliability Engineer builds automation, observability, and infrastructure to ensure scalable, secure platform operations in AI-accelerated environments. Requires 3+ years SRE/DevOps experience, cloud expertise, programming proficiency, and bachelor's degree.
About the role
Responsibilities
- Build and maintain automated reliability tooling, infrastructure as code, and observability systems that enhance uptime and service performance.
- Develop monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, OpenTelemetry) to detect and remediate issues proactively.
- Implement automated architectural reviews and reliability guardrails for agent-developed applications to ensure machine-generated code meets long-term maintainability and performance standards.
- Partner with engineering teams to design and implement scalable, fault-tolerant systems that meet defined SLIs and SLOs.
- Automate repetitive operational tasks and develop self-healing and auto-remediation mechanisms to minimize human intervention.
- Participate in on-call rotations and lead incident response efforts, performing post-incident reviews and driving systemic improvements.
- Improve the deployment and release process using CI/CD pipelines and progressive delivery techniques to ensure stability and safety.
- Champion observability, reliability, and operational readiness reviews as part of the development process.
- Collaborate with Security and Compliance teams to ensure production systems meet FedRAMP, NIST, and internal policy requirements.
- Contribute to documentation, runbooks, and internal tooling to enhance knowledge sharing and operational maturity across teams.
Minimum Qualifications
- Bachelor’s degree in Computer Science, Software Engineering, or a related technical field.
- 3-5 years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
- 2+ years of hands-on experience managing and scaling services in cloud environments such as AWS, GCP, or Azure.
- 1+ years proficiency in at least one modern programming language (e.g., Java, Go, Python, Ruby, JavaScript).
Preferred Qualifications
- Strong understanding of containerization and orchestration technologies (Docker, Kubernetes).
- Experience implementing and maintaining CI/CD pipelines and automation frameworks.
- Working knowledge of observability systems—metrics, tracing, logging, and alerting.
- Experience building automated recovery, failover, or chaos-engineering systems to validate reliability.
- Familiarity with event-driven architecture and asynchronous processing systems.
- Knowledge of distributed systems design, load balancing, and performance optimization.
- Exposure to infrastructure-as-code tools (Terraform, Pulumi, Ansible) and GitOps practices.
- Understanding of security and compliance frameworks (FedRAMP, SOC2, or NIST 800-53).
- Strong analytical and troubleshooting skills across the stack—from network to application layer.
- Excellent communication and documentation skills, with a focus on cross-team collaboration and continuous improvement.
- Experience using AI agentic coding assistants and deploying custom AI agents or automated workflows into production environments.
Compensation
- Base salary: $168,926—$192,500 USD (Mountain View, CA)
- Comprehensive benefits including medical, dental, vision, 401(k) match, unlimited PTO, parental leave, and more.
Skills
KubernetesDockerPrometheusGrafanaOpenTelemetryTerraformPulumiAnsibleAWSGCPAzurePythonGoJavaCI/CD pipelines
Similar roles at this salary range
All DevOps / SRE jobs →Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.
194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Senior Platform Reliability Engineer
Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.
182k – 250kSan Francisco, CA +2DevOps / SREHybrid6+ YOEAWSEKS