Skip to content

Production Engineer, IaaS

175k – 300kSan Francisco, CANew York, NYAustin, TXSeattle, WADevOps / SREOnsite5+ YOE
Summary

Own observability, API surface, and control plane for a hyperscale AI compute fleet. Build production-grade data pipelines, stateful APIs, and Kubernetes infrastructure that other teams depend on.

About the role

Responsibilities

  • Own the observability platform: build and operate data pipelines, decoration and correlation engine, and healthcheck framework that make the fleet legible from site down to device and link
  • Define and build the API surface for infrastructure: design contracts between production infrastructure and every tool that touches it
  • Build the production control plane: unified machine management, actual state inspection, distributed command execution, and the Kubernetes-based infrastructure that underpins it all
  • Own fleet state as source of truth: SLOs, site lifecycle state, and integration with internal infrastructure management and customer-facing operations platforms
  • Land new hardware into the platform cleanly: ZTP, DHCP, DNS, artifacts — every new XPU generation and site integration goes through IaaS before production

Requirements

  • Treat toil as a bug: if something requires a human to do it twice, build the thing that makes it not require a human
  • Design APIs that age well; avoid leaky abstractions at scale
  • Move toward ambiguity, build the map, and explain it to everyone else
  • Learn at a steep slope and reach real competence in unfamiliar domains fast
  • Carry a pager without flinching; run incidents, write postmortems, fix systemic causes
  • Fluent with AI tooling: LLM APIs, MCP servers, agentic frameworks; drive Claude Code, Cursor, or similar daily
  • Shipped production services that other teams depend on at scale; comfortable in any language using AI coding tools

Nice-to-Haves

  • Distributed systems and data pipeline engineering
  • Time-series observability stacks (Prometheus, Thanos, VictoriaMetrics)
  • API design and versioning at scale
  • Workflow and orchestration engines (Temporal, Cadence)
  • BMC/Redfish or hardware telemetry
  • Go, Python, and Postgres
Skills
KubernetesPrometheusGoPythonPostgresTemporalDistributed SystemsObservabilityAPI DesignBMC/Redfish
Similar roles at this salary range
All DevOps / SRE jobs →
Plaid

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3