Site Reliability Engineer

Owns production reliability for critical systems, builds SRE function from scratch, introduces modern practices like SLIs/SLOs and error budgets. Requires 5+ years SRE experience with large-scale distributed systems.

130k – 500kSan Francisco, CADevOps / SREOnsite5+ YOE

Apply

About the role

What You’ll Do

Own reliability and production safety for core shared services and customer-facing systems.
Partner directly with infrastructure leadership to define SRE priorities, reliability standards, and production safety roadmap.
Repair and improve how our production systems are structured so they are stable, resource-efficient, isolated, and well-observed.
Introduce and champion modern SRE practices (e.g., incident response, postmortems, SLIs/SLOs) across engineering teams.
Collaborate with leverage engineering and applied AI teams to ensure sustainable growth.
Represent SRE best practices internally and help teams onboard onto production in a way that is safe, scalable, and consistent with SRE principles.

What We’re Looking For

Experience doing true SRE work (not just operations) across multiple roles or companies.
Deep familiarity with SRE practices as popularized by Google (e.g., error budgets, reliability vs. risk trade-offs, large-scale distributed systems).
5+ years of SRE experience; 15+ years of overall experience is ideal for this first SRE hire.
Proven success operating systems at scale, with a strong understanding of the challenges of large, distributed production environments.
Strong collaboration skills; able to work efficiently with cross-functional engineering teams.
Ability to drive cultural change around reliability while remaining hands-on in building and fixing systems.
Comfort working in high-intensity, high-availability environments where uptime and production quality are critical.

Nice to Haves

Experience as a founding SRE or early SRE hire, standing up SRE practices and orgs from scratch.
Hands-on experience in the AWS ecosystem, Kubernetes, and modern IaC tooling (Terraform, Spacelift, etc.).

Skills

KubernetesAWSTerraformSre PracticesSlis/SlosError BudgetsIncident ResponsePostmortemsDistributed SystemsIac

Similar roles

DevOps / SRE jobs

Trexquant

Linux Systems Engineer (USA)

Hands-on Linux Systems Engineer builds and maintains bare-metal servers, manages storage like ZFS, automates with Ansible and Bash, and ensures production reliability. Requires 3+ years Linux experience, physical server management, and on-call rotation with data center travel.

130k – 150kStamford, CT +1DevOps / SREOn-site3+ YOEZfsBash

Nominal

Software Engineer - Developer Infrastructure

Develop and maintain developer tooling and infrastructure for Nominal's platform, scaling across air-gapped, cloud, and on-prem environments. Requires 4+ years experience with cloud services, Docker, Kubernetes, CI/CD, and ability to mentor engineers.

130k – 230kNew York, NY +2DevOps / SREOn-site4+ YOEAWSGCP

Black Canyon Consulting

Vault Application Engineer/Administrator (Hashicorp)

Designs, deploys, and manages HashiCorp Vault clusters for secure secret management in on-premises and cloud (AWS/GCP) hybrid environments with Kubernetes integration. Requires 3+ years experience, zero trust principles, IaC tools like Terraform, and automation scripting.

130k – 180kBethesda, MDDevOps / SREHybrid3+ YOEAWSGCP

Mercor

Infrastructure Engineer

Builds and scales highly available infrastructure using AWS, Terraform, and Docker to support rapid growth and AI workloads. Collaborates with product and research teams on architectures, CI/CD, monitoring, and performance optimization.

130k – 500kSan Francisco, CADevOps / SREOn-siteGoAWS

Replit

Software Engineer, Compute Platform

Build and optimize Replit's cloud infrastructure for scalable application deployment, focusing on reliability, cost efficiency, and global performance using distributed systems expertise.

130k – 290kFoster City, CADevOps / SREHybridGoGCP