What You'll Own

Partner with service teams to define meaningful SLIs and SLOs grounded in customer experience, and build the error budget policies that turn them into engineering decisions
Own and evolve the Operational Readiness Review (ORR) process — conducting reviews for new services and major changes across observability, alerting, runbooks, capacity, and graceful degradation
Strengthen the incident-to-improvement pipeline: connecting postmortem findings to operational readiness gaps, identifying repeat failure patterns, and driving systemic fixes
Act as the reliability expert teams pull in for architecture reviews, failure mode analysis, dependency mapping, and resilience design
Identify and quantify operational toil across the org, and build or advocate for automation that eliminates it
Help teams design sustainable on-call practices: alert quality, escalation paths, runbook coverage, and noise reduction
Track and report on org-wide operational maturity, surfacing systemic gaps and driving remediation

Requirements

7+ years of experience in SRE, production engineering, or reliability-focused roles, including experience shaping SRE practices and driving adoption across engineering teams
Software engineering mindset — write code and build tools, not just configure them
Hands-on experience defining and operationalizing SLOs/SLIs at scale, including error budget policies that actually influenced engineering decisions
Deep experience with incident response, postmortem facilitation, and turning incident learnings into systemic improvements
Worked with large-scale multi-tenant systems (bonus: managed database platforms or Postgres)
Proficient with cloud infrastructure (AWS preferred) and infrastructure-as-code (Pulumi preferred, Terraform/CDK also acceptable)
Communicate clearly and persuasively — this role requires influencing without authority across a distributed org
Experience in async or globally distributed teams
Energized by making other teams more effective rather than being the one who fixes everything

Experience with Kubernetes-based platform operations
Familiarity with OpenTelemetry, VictoriaMetrics, Grafana, or similar observability tooling
Experience building developer-facing reliability tooling (SLO dashboards, ORR frameworks, toil tracking, DORA metrics)