Skip to content

Staff Production Operations Engineer

211k – 256kUnited StatesRemote5+ YOE
Summary

Staff-level role driving Redpanda's reliability operations program. Combines hands-on SRE with coordination of on-call, incident reviews, and AI-driven automation to improve global production reliability.

About the role

Responsibilities

  • Drive process improvements across the incident lifecycle: severity models, triage enforcement, alert noise reduction, and follow-up completion rates
  • Coordinate the on-call program across multiple geographies: manage schedules and shadow rotations, onboard new engineers, and ensure consistent coverage
  • Select incidents for post-incident review, facilitate blameless post-incident reviews, document findings, and track follow-up completion
  • Build AI agents to automate operational toil, including on-call automation, incident summarization, post-incident reviews prep, follow-up tracking, and on-call analytics
  • Maintain runbooks, playbooks, and incident process documentation

Requirements

  • 5+ years of experience in site reliability engineering, DevOps, or production operations in large-scale, highly reliable environments
  • Track record of leading initiatives end-to-end, from design and planning to execution and production operation
  • Hands-on experience with incident management tooling (incident.io, PagerDuty, or similar) and observability stacks (Datadog, Grafana, Sentry, CloudWatch, or equivalent)
  • Strong fluency with reliability concepts: MTTD, MTTR, MTTA, error budgets, SLOs
  • Experience building automation and tooling to reduce operational toil
  • Proficiency in Go (or comparable systems language with willingness to ramp)
  • Experience with AI-assisted software development workflows including tools like Claude Code
  • Working knowledge of at least one of AWS / Azure / GCP, including infrastructure as code for system and network infrastructure
  • Strong written communication; ability to drive alignment across engineering teams without direct authority

Nice to Have

  • Hands-on experience building agents or automations using LLMs
  • Familiarity with Redpanda, Apache Kafka, or other streaming infrastructure
  • Prior experience in a fast-growing B2B infrastructure or developer tools company
Skills
GoAWSAzureGCPDatadogGrafanaSentryCloudWatchPagerDutyincident.ioSREDevOpsInfrastructure as Code
Similar roles at this salary range
All DevOps / SRE jobs →
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3
Stuut

Lead Voice Infrastructure Engineer

Lead the design and operation of scalable telephony infrastructure powering AI voice agents for accounts receivable workflows, including SIP trunking, call routing, realtime media, and integrations with speech systems.

250k – 290kSan Francisco, CA +1DevOps / SREOn-site7+ YOECGo