Site Reliability Engineer
United StatesRemote7+ YOE
Summary
SRE embedded in Service Operations to establish reliability practices, frameworks, and feedback loops across engineering teams. Focus on SLOs/SLIs, ORR processes, incident-to-improvement pipelines, and influencing without authority in a distributed environment.
About the role
What You'll Own
- Partner with service teams to define meaningful SLIs and SLOs grounded in customer experience, and build the error budget policies that turn them into engineering decisions
- Own and evolve the Operational Readiness Review (ORR) process — conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation
- Strengthen the incident-to-improvement pipeline: connecting postmortem findings to operational readiness gaps, identifying repeat failure patterns, and driving systemic fixes
- Act as the reliability expert teams pull in for architecture reviews, failure mode analysis, dependency mapping, and resilience design
- Identify and quantify operational toil across the org, and build or advocate for automation that eliminates it
- Help teams design sustainable on-call practices: alert quality, escalation paths, runbook coverage, and noise reduction
- Track and report on org-wide operational maturity, surfacing systemic gaps and driving remediation
Requirements
- 7+ years of experience in SRE, production engineering, or reliability-focused roles, including experience shaping SRE practices and driving adoption across engineering teams
- Software engineering mindset — write code and build tools, not just configure them
- Hands-on experience defining and operationalizing SLOs/SLIs at scale, including error budget policies that actually influenced engineering decisions
- Deep experience with incident response, postmortem facilitation, and turning incident learnings into systemic improvements
- Worked with large-scale multi-tenant systems (bonus: managed database platforms or Postgres)
- Proficient with cloud infrastructure (AWS preferred) and infrastructure-as-code (Pulumi preferred, Terraform/CDK also acceptable)
- Communicate clearly and persuasively — this role requires influencing without authority across a distributed org
- Experience in async or globally distributed teams
- Energized by making other teams more effective rather than being the one who fixes everything
Nice to Have
- Experience with Kubernetes-based platform operations
- Familiarity with OpenTelemetry, VictoriaMetrics, Grafana, or similar observability tooling
- Experience building developer-facing reliability tooling (SLO dashboards, ORR frameworks, toil tracking, DORA metrics)
Skills
SRESLOsSLIserror budgetsincident responsepostmortemsAWSPulumiTerraformKubernetesOpenTelemetryGrafanaobservability