Reliability Engineer, Supercomputing

350k – 475kSan Francisco, CADevOps / SREOnsiteJun 24

Summary

Ensure reliability of large GPU supercomputing clusters by diagnosing hardware/firmware/OS issues, automating monitoring, driving firmware rollouts, and working directly with vendors.

About the role

What You’ll Do

Investigate, reproduce, and remediate issues across large GPU clusters.
Own the drivers, kernel surface, and diagnostics that span hardware, firmware, and OS.
Automate the monitoring of fleet reliability and analyze error rates to validate whether a fix or firmware change measurably reduced failures rather than shifting them around.
Drive the firmware lifecycle: tracking, qualification, staged rollout, and regression analysis.
Engage vendors directly — GPUs, server OEMs, NIC vendors, and storage vendors — to get real fixes rather than ticket numbers. Manage RMA flows when hardware needs to come out.
Monitor and improve GPU hardware health signals and turn them into actionable reliability improvements.
Write clear postmortems and vendor cases that move issues forward.

Skills and Qualifications

Minimum qualifications:

Bachelor’s degree or equivalent experience in computer science, engineering, or similar.
Proficiency in at least one backend language (we use Python or Rust).
Experience operating large‑scale clusters and container orchestration systems (e.g. Kubernetes or Slurm).
Comfort operating across the stack and owning projects end-to-end.
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.

Preferred qualifications:

Fluency with Linux systems and debugging tools.
Proven statistical rigor in analyzing reliability.
A track record of debugging a problem from application symptom to the root cause in hardware.
Comfort reading vendor errata, firmware release notes, and kernel changelogs.
Experience engaging hardware vendors directly — not just through escalation portals.
Linux kernel literacy: the scheduler, memory management, IRQ paths, and the driver model.
Out-of-band management experience: BMC / iDRAC / IPMI / Redfish.
Depth in GPU hardware health: Xid error taxonomy, NVLink, NVSwitch, fabric manager, and DCGM.
Proficiency in at least one backend language (we use Python and Rust).
Significant ownership of the hardware reliability function at scale.
Strong writing skills for vendor cases and postmortems.
An instinct for telling apart a flaky machine, a flaky workload, and a flaky test.

Skills

PythonRustKubernetesSlurmLinuxBMCiDRACIPMIRedfishDCGMNVLinkNVSwitch

Similar roles at this salary range

All DevOps / SRE jobs →

Thinking Machines Lab

Jun 24

Network Engineer, Supercomputing

Own and debug multi-thousand-GPU network fabric (RDMA/RoCE, NVLink/NVSwitch) for large-scale AI training and inference. Requires backend language proficiency, large-scale cluster experience, and cross-stack ownership.

350k – 475kSan Francisco, CADevOps / SREOn-siteRustRDMA

Anthropic

Jun 19

Staff Software Engineer, Developer Productivity

Staff-level IC role owning end-to-end CI/CD, merge queue, and deploy pipelines for Anthropic's engineering org. Focus on AI-assisted review, test reliability, and progressive delivery at monorepo scale.

405k – 485kSan Francisco, CA +1DevOps / SREHybrid7+ YOEGoRust

Anthropic

Jun 19

Staff Software Engineer, Developer Productivity

Staff-level engineer to own end-to-end development environments at Anthropic, focusing on container lifecycle, cold-start optimization, environment isolation, and pre-push validation for AI researchers and engineers.

405k – 485kSan Francisco, CA +1DevOps / SREHybrid7+ YOEGoNix

Anthropic

Jun 17

Staff Software Engineer, Node Infra

Own technical strategy and roadmap for node lifecycle management, health automation, and scaling AI clusters across clouds and accelerators. Requires deep distributed systems expertise, ML accelerator experience, and 12+ years leading complex multi-team infrastructure initiatives.

405k – 485kSan Francisco, CA +2DevOps / SREHybrid12+ YOEGoAWS

Anthropic

Jun 17

Staff Software Engineer, Kubernetes Platform

Senior-level engineer to own and scale Anthropic's massive Kubernetes control plane and scheduler for training frontier AI models across hundreds of thousands of nodes. Requires deep Kubernetes internals experience and 12+ years building production distributed systems.

405k – 485kSan Francisco, CA +2DevOps / SREHybrid12+ YOEGoC++

Apply