Skip to content

Site Reliability Engineer

United StatesRemote7+ YOE
Summary

SRE embedded in Service Operations to establish reliability practices, frameworks, and feedback loops across engineering teams. Focus on SLOs/SLIs, ORR processes, incident-to-improvement pipelines, and influencing without authority in a distributed environment.

About the role

What You'll Own

  • Partner with service teams to define meaningful SLIs and SLOs grounded in customer experience, and build the error budget policies that turn them into engineering decisions
  • Own and evolve the Operational Readiness Review (ORR) process — conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation
  • Strengthen the incident-to-improvement pipeline: connecting postmortem findings to operational readiness gaps, identifying repeat failure patterns, and driving systemic fixes
  • Act as the reliability expert teams pull in for architecture reviews, failure mode analysis, dependency mapping, and resilience design
  • Identify and quantify operational toil across the org, and build or advocate for automation that eliminates it
  • Help teams design sustainable on-call practices: alert quality, escalation paths, runbook coverage, and noise reduction
  • Track and report on org-wide operational maturity, surfacing systemic gaps and driving remediation

Requirements

  • 7+ years of experience in SRE, production engineering, or reliability-focused roles, including experience shaping SRE practices and driving adoption across engineering teams
  • Software engineering mindset — write code and build tools, not just configure them
  • Hands-on experience defining and operationalizing SLOs/SLIs at scale, including error budget policies that actually influenced engineering decisions
  • Deep experience with incident response, postmortem facilitation, and turning incident learnings into systemic improvements
  • Worked with large-scale multi-tenant systems (bonus: managed database platforms or Postgres)
  • Proficient with cloud infrastructure (AWS preferred) and infrastructure-as-code (Pulumi preferred, Terraform/CDK also acceptable)
  • Communicate clearly and persuasively — this role requires influencing without authority across a distributed org
  • Experience in async or globally distributed teams
  • Energized by making other teams more effective rather than being the one who fixes everything

Nice to Have

  • Experience with Kubernetes-based platform operations
  • Familiarity with OpenTelemetry, VictoriaMetrics, Grafana, or similar observability tooling
  • Experience building developer-facing reliability tooling (SLO dashboards, ORR frameworks, toil tracking, DORA metrics)
Skills
SRESLOsSLIserror budgetsincident responsepostmortemsAWSPulumiTerraformKubernetesOpenTelemetryGrafanaobservability