Production Engineer, Compute
175k – 300kSan Francisco, CANew York, NYSeattle, WAAustin, TXDevOps / SREHybrid5+ YOE
Summary
Own end-to-end health, repair automation, and qualification of a hyperscale GPU/TPU compute fleet. Build metrics pipelines, firmware tooling, and self-healing repair workflows across Kubernetes and bare metal.
About the role
Role Scope
- Own compute fleet health end to end. Build the metrics pipelines, alerting, and unified health view that tell you the true state of every GPU and TPU in production — across Kubernetes-orchestrated workloads and bare metal, at scale.
- Turn repair into a pipeline, not a procedure. Build and own the automation that takes a compute failure from detection through triage, parts management, and return to service.
- Design and expand the XPU qualification platform. Burn-in, performance baselining, and NPI execution for every new GPU and TPU generation.
- Own Redfish and BMC tooling. Firmware-level telemetry, log collection at fleet scale, and the low-level access layer that repair automation and health tooling depend on.
- Own end-to-end reliability, scalability, and operation of the compute fleet at-scale.
What We're Looking For
- Treat toil as a bug. Manual steps in a repair workflow are a backlog item, not a job description.
- Instinct for hardware. Comfortable reasoning about failure modes at the firmware and silicon level.
- Move toward ambiguity, not away from it. Walk into the fog, build the map, and explain it to everyone else.
- Learn at a steep slope. Reach real competence in an unfamiliar domain fast.
- Carry a pager without flinching. Run the incident, write the postmortem, fix the systemic cause, and move on.
- Fluent with AI tooling. LLM APIs, MCP servers, and agentic frameworks; drive Claude Code, Cursor, or similar every day.
- Shipped production automation that other teams depend on, and comfortable in any language using AI coding tools.
Bonus
- Hardware lifecycle management and RMA automation.
- BMC/Redfish or IPMI tooling.
- GPU/TPU qualification or burn-in frameworks.
- Workflow and orchestration engines (Temporal, Cadence).
- Metrics and alerting pipelines (Prometheus, Grafana).
- Go or Python.
Salary & Benefits
- Competitive total compensation package (salary + equity).
- Retirement or pension plan, in line with local norms.
- Health, dental, and vision insurance.
- Generous PTO policy, in line with local norms.
- Base salary range: $175,000 - $300,000 per year, depending on experience, skills, qualifications, and location. Total compensation may also include equity in the form of stock options.
Skills
KubernetesRedfishBMCPrometheusGrafanaGoPythonTemporalCadenceIPMI
Similar roles at this salary range
All DevOps / SRE jobs →Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.
194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE