Staff Production Operations Engineer
211k – 256kUnited StatesRemote5+ YOE
Summary
Staff-level role driving Redpanda's reliability operations program. Combines hands-on SRE with coordination of on-call, incident reviews, and AI-driven automation to improve global production reliability.
About the role
Responsibilities
- Drive process improvements across the incident lifecycle: severity models, triage enforcement, alert noise reduction, and follow-up completion rates
- Coordinate the on-call program across multiple geographies: manage schedules and shadow rotations, onboard new engineers, and ensure consistent coverage
- Select incidents for post-incident review, facilitate blameless post-incident reviews, document findings, and track follow-up completion
- Build AI agents to automate operational toil, including on-call automation, incident summarization, post-incident reviews prep, follow-up tracking, and on-call analytics
- Maintain runbooks, playbooks, and incident process documentation
Requirements
- 5+ years of experience in site reliability engineering, DevOps, or production operations in large-scale, highly reliable environments
- Track record of leading initiatives end-to-end, from design and planning to execution and production operation
- Hands-on experience with incident management tooling (incident.io, PagerDuty, or similar) and observability stacks (Datadog, Grafana, Sentry, CloudWatch, or equivalent)
- Strong fluency with reliability concepts: MTTD, MTTR, MTTA, error budgets, SLOs
- Experience building automation and tooling to reduce operational toil
- Proficiency in Go (or comparable systems language with willingness to ramp)
- Experience with AI-assisted software development workflows including tools like Claude Code
- Working knowledge of at least one of AWS / Azure / GCP, including infrastructure as code for system and network infrastructure
- Strong written communication; ability to drive alignment across engineering teams without direct authority
Nice to Have
- Hands-on experience building agents or automations using LLMs
- Familiarity with Redpanda, Apache Kafka, or other streaming infrastructure
- Prior experience in a fast-growing B2B infrastructure or developer tools company
Skills
GoAWSAzureGCPDatadogGrafanaSentryCloudWatchPagerDutyincident.ioSREDevOpsInfrastructure as Code
Similar roles at this salary range
All DevOps / SRE jobs →Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.
194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Lead Voice Infrastructure Engineer
Lead the design and operation of scalable telephony infrastructure powering AI voice agents for accounts receivable workflows, including SIP trunking, call routing, realtime media, and integrations with speech systems.
250k – 290kSan Francisco, CA +1DevOps / SREOn-site7+ YOECGo