Production Engineer, IaaS

175k – 300kSan Francisco, CANew York, NYAustin, TXSeattle, WADevOps / SREOnsite5+ YOEJun 9

Summary

Own observability, API surface, and control plane for a hyperscale AI compute fleet. Build production-grade data pipelines, stateful APIs, and Kubernetes infrastructure that other teams depend on.

About the role

Responsibilities

Own the observability platform: build and operate data pipelines, decoration and correlation engine, and healthcheck framework that make the fleet legible from site down to device and link
Define and build the API surface for infrastructure: design contracts between production infrastructure and every tool that touches it
Build the production control plane: unified machine management, actual state inspection, distributed command execution, and the Kubernetes-based infrastructure that underpins it all
Own fleet state as source of truth: SLOs, site lifecycle state, and integration with internal infrastructure management and customer-facing operations platforms
Land new hardware into the platform cleanly: ZTP, DHCP, DNS, artifacts — every new XPU generation and site integration goes through IaaS before production

Requirements

Treat toil as a bug: if something requires a human to do it twice, build the thing that makes it not require a human
Design APIs that age well; avoid leaky abstractions at scale
Move toward ambiguity, build the map, and explain it to everyone else
Learn at a steep slope and reach real competence in unfamiliar domains fast
Carry a pager without flinching; run incidents, write postmortems, fix systemic causes
Fluent with AI tooling: LLM APIs, MCP servers, agentic frameworks; drive Claude Code, Cursor, or similar daily
Shipped production services that other teams depend on at scale; comfortable in any language using AI coding tools

Nice-to-Haves

Distributed systems and data pipeline engineering
Time-series observability stacks (Prometheus, Thanos, VictoriaMetrics)
API design and versioning at scale
Workflow and orchestration engines (Temporal, Cadence)
BMC/Redfish or hardware telemetry
Go, Python, and Postgres

Skills

KubernetesPrometheusGoPythonPostgresTemporalDistributed SystemsObservabilityAPI DesignBMC/Redfish

Similar roles at this salary range

All DevOps / SRE jobs →

Plaid

Jun 19

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Apply