Skip to content

Site Reliability Engineer

Owns production reliability for critical systems, builds SRE function from scratch, introduces modern practices like SLIs/SLOs and error budgets. Requires 5+ years SRE experience with large-scale distributed systems.

130k – 500kSan Francisco, CADevOps / SREOnsite5+ YOE

About the role

What You’ll Do

  • Own reliability and production safety for core shared services and customer-facing systems.
  • Partner directly with infrastructure leadership to define SRE priorities, reliability standards, and production safety roadmap.
  • Repair and improve how our production systems are structured so they are stable, resource-efficient, isolated, and well-observed.
  • Introduce and champion modern SRE practices (e.g., incident response, postmortems, SLIs/SLOs) across engineering teams.
  • Collaborate with leverage engineering and applied AI teams to ensure sustainable growth.
  • Represent SRE best practices internally and help teams onboard onto production in a way that is safe, scalable, and consistent with SRE principles.

What We’re Looking For

  • Experience doing true SRE work (not just operations) across multiple roles or companies.
  • Deep familiarity with SRE practices as popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).
  • 5+ years of SRE experience; 15+ years of overall experience is ideal for this first SRE hire.
  • Proven success operating systems at scale, with a strong understanding of the challenges of large, distributed production environments.
  • Strong collaboration skills; able to work efficiently with cross-functional engineering teams.
  • Ability to drive cultural change around reliability while remaining hands-on in building and fixing systems.
  • Comfort working in high-intensity, high-availability environments where uptime and production quality are critical.

Nice to Haves

  • Experience as a founding SRE or early SRE hire, standing up SRE practices and orgs from scratch.
  • Hands-on experience in the AWS ecosystem, Kubernetes, and modern IaC tooling (Terraform, Spacelift, etc.).

Skills

KubernetesAWSTerraformSre PracticesSlis/SlosError BudgetsIncident ResponsePostmortemsDistributed SystemsIac

Similar roles

DevOps / SRE jobs

Linux Systems Engineer (USA)

Hands-on Linux Systems Engineer builds and maintains bare-metal servers, manages storage like ZFS, automates with Ansible and Bash, and ensures production reliability. Requires 3+ years Linux experience, physical server management, and on-call rotation with data center travel.

130k – 150kStamford, CT +1DevOps / SREOn-site3+ YOEZfsBash

Software Engineer - Developer Infrastructure

Develop and maintain developer tooling and infrastructure for Nominal's platform, scaling across air-gapped, cloud, and on-prem environments. Requires 4+ years experience with cloud services, Docker, Kubernetes, CI/CD, and ability to mentor engineers.

130k – 230kNew York, NY +2DevOps / SREOn-site4+ YOEAWSGCP

Vault Application Engineer/Administrator (Hashicorp)

Designs, deploys, and manages HashiCorp Vault clusters for secure secret management in on-premises and cloud (AWS/GCP) hybrid environments with Kubernetes integration. Requires 3+ years experience, zero trust principles, IaC tools like Terraform, and automation scripting.

130k – 180kBethesda, MDDevOps / SREHybrid3+ YOEAWSGCP

Infrastructure Engineer

Builds and scales highly available infrastructure using AWS, Terraform, and Docker to support rapid growth and AI workloads. Collaborates with product and research teams on architectures, CI/CD, monitoring, and performance optimization.

130k – 500kSan Francisco, CADevOps / SREOn-siteGoAWS

Software Engineer, Compute Platform

Build and optimize Replit's cloud infrastructure for scalable application deployment, focusing on reliability, cost efficiency, and global performance using distributed systems expertise.

130k – 290kFoster City, CADevOps / SREHybridGoGCP