Staff Production Operations Engineer

211k – 256kUnited StatesRemote5+ YOEJun 15

Summary

Staff-level role driving Redpanda's reliability operations program. Combines hands-on SRE with coordination of on-call, incident reviews, and AI-driven automation to improve global production reliability.

About the role

Responsibilities

Drive process improvements across the incident lifecycle: severity models, triage enforcement, alert noise reduction, and follow-up completion rates
Coordinate the on-call program across multiple geographies: manage schedules and shadow rotations, onboard new engineers, and ensure consistent coverage
Select incidents for post-incident review, facilitate blameless post-incident reviews, document findings, and track follow-up completion
Build AI agents to automate operational toil, including on-call automation, incident summarization, post-incident reviews prep, follow-up tracking, and on-call analytics
Maintain runbooks, playbooks, and incident process documentation

Requirements

5+ years of experience in site reliability engineering, DevOps, or production operations in large-scale, highly reliable environments
Track record of leading initiatives end-to-end, from design and planning to execution and production operation
Hands-on experience with incident management tooling (incident.io, PagerDuty, or similar) and observability stacks (Datadog, Grafana, Sentry, CloudWatch, or equivalent)
Strong fluency with reliability concepts: MTTD, MTTR, MTTA, error budgets, SLOs
Experience building automation and tooling to reduce operational toil
Proficiency in Go (or comparable systems language with willingness to ramp)
Experience with AI-assisted software development workflows including tools like Claude Code
Working knowledge of at least one of AWS / Azure / GCP, including infrastructure as code for system and network infrastructure
Strong written communication; ability to drive alignment across engineering teams without direct authority

Nice to Have

Hands-on experience building agents or automations using LLMs
Familiarity with Redpanda, Apache Kafka, or other streaming infrastructure
Prior experience in a fast-growing B2B infrastructure or developer tools company

Skills

GoAWSAzureGCPDatadogGrafanaSentryCloudWatchPagerDutyincident.ioSREDevOpsInfrastructure as Code

Similar roles at this salary range

All DevOps / SRE jobs →

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Stuut

Jun 17

Lead Voice Infrastructure Engineer

Lead the design and operation of scalable telephony infrastructure powering AI voice agents for accounts receivable workflows, including SIP trunking, call routing, realtime media, and integrations with speech systems.

250k – 290kSan Francisco, CA +1DevOps / SREOn-site7+ YOECGo

Apply