Skip to content

Production Engineer, Compute

175k – 300kSan Francisco, CANew York, NYSeattle, WAAustin, TXDevOps / SREHybrid5+ YOE
Summary

Own end-to-end health, repair automation, and qualification of a hyperscale GPU/TPU compute fleet. Build metrics pipelines, firmware tooling, and self-healing repair workflows across Kubernetes and bare metal.

About the role

Role Scope

  • Own compute fleet health end to end. Build the metrics pipelines, alerting, and unified health view that tell you the true state of every GPU and TPU in production — across Kubernetes-orchestrated workloads and bare metal, at scale.
  • Turn repair into a pipeline, not a procedure. Build and own the automation that takes a compute failure from detection through triage, parts management, and return to service.
  • Design and expand the XPU qualification platform. Burn-in, performance baselining, and NPI execution for every new GPU and TPU generation.
  • Own Redfish and BMC tooling. Firmware-level telemetry, log collection at fleet scale, and the low-level access layer that repair automation and health tooling depend on.
  • Own end-to-end reliability, scalability, and operation of the compute fleet at-scale.

What We're Looking For

  • Treat toil as a bug. Manual steps in a repair workflow are a backlog item, not a job description.
  • Instinct for hardware. Comfortable reasoning about failure modes at the firmware and silicon level.
  • Move toward ambiguity, not away from it. Walk into the fog, build the map, and explain it to everyone else.
  • Learn at a steep slope. Reach real competence in an unfamiliar domain fast.
  • Carry a pager without flinching. Run the incident, write the postmortem, fix the systemic cause, and move on.
  • Fluent with AI tooling. LLM APIs, MCP servers, and agentic frameworks; drive Claude Code, Cursor, or similar every day.
  • Shipped production automation that other teams depend on, and comfortable in any language using AI coding tools.

Bonus

  • Hardware lifecycle management and RMA automation.
  • BMC/Redfish or IPMI tooling.
  • GPU/TPU qualification or burn-in frameworks.
  • Workflow and orchestration engines (Temporal, Cadence).
  • Metrics and alerting pipelines (Prometheus, Grafana).
  • Go or Python.

Salary & Benefits

  • Competitive total compensation package (salary + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.
  • Base salary range: $175,000 - $300,000 per year, depending on experience, skills, qualifications, and location. Total compensation may also include equity in the form of stock options.
Skills
KubernetesRedfishBMCPrometheusGrafanaGoPythonTemporalCadenceIPMI
Similar roles at this salary range
All DevOps / SRE jobs →
Plaid

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3