Production Engineer, Compute

175k – 300kSan Francisco, CANew York, NYSeattle, WAAustin, TXDevOps / SREHybrid5+ YOEJun 9

Summary

Own end-to-end health, repair automation, and qualification of a hyperscale GPU/TPU compute fleet. Build metrics pipelines, firmware tooling, and self-healing repair workflows across Kubernetes and bare metal.

About the role

Role Scope

Own compute fleet health end to end. Build the metrics pipelines, alerting, and unified health view that tell you the true state of every GPU and TPU in production — across Kubernetes-orchestrated workloads and bare metal, at scale.
Turn repair into a pipeline, not a procedure. Build and own the automation that takes a compute failure from detection through triage, parts management, and return to service.
Design and expand the XPU qualification platform. Burn-in, performance baselining, and NPI execution for every new GPU and TPU generation.
Own Redfish and BMC tooling. Firmware-level telemetry, log collection at fleet scale, and the low-level access layer that repair automation and health tooling depend on.
Own end-to-end reliability, scalability, and operation of the compute fleet at-scale.

What We're Looking For

Treat toil as a bug. Manual steps in a repair workflow are a backlog item, not a job description.
Instinct for hardware. Comfortable reasoning about failure modes at the firmware and silicon level.
Move toward ambiguity, not away from it. Walk into the fog, build the map, and explain it to everyone else.
Learn at a steep slope. Reach real competence in an unfamiliar domain fast.
Carry a pager without flinching. Run the incident, write the postmortem, fix the systemic cause, and move on.
Fluent with AI tooling. LLM APIs, MCP servers, and agentic frameworks; drive Claude Code, Cursor, or similar every day.
Shipped production automation that other teams depend on, and comfortable in any language using AI coding tools.

Bonus

Hardware lifecycle management and RMA automation.
BMC/Redfish or IPMI tooling.
GPU/TPU qualification or burn-in frameworks.
Workflow and orchestration engines (Temporal, Cadence).
Metrics and alerting pipelines (Prometheus, Grafana).
Go or Python.

Salary & Benefits

Competitive total compensation package (salary + equity).
Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.
Base salary range: $175,000 - $300,000 per year, depending on experience, skills, qualifications, and location. Total compensation may also include equity in the form of stock options.

Skills

KubernetesRedfishBMCPrometheusGrafanaGoPythonTemporalCadenceIPMI

Similar roles at this salary range

All DevOps / SRE jobs →

Plaid

Jun 19

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Apply