Reliability Engineer, Supercomputing
Ensure reliability of large GPU supercomputing clusters by diagnosing hardware/firmware/OS issues, automating monitoring, driving firmware rollouts, and working directly with vendors.
What You’ll Do
- Investigate, reproduce, and remediate issues across large GPU clusters.
- Own the drivers, kernel surface, and diagnostics that span hardware, firmware, and OS.
- Automate the monitoring of fleet reliability and analyze error rates to validate whether a fix or firmware change measurably reduced failures rather than shifting them around.
- Drive the firmware lifecycle: tracking, qualification, staged rollout, and regression analysis.
- Engage vendors directly — GPUs, server OEMs, NIC vendors, and storage vendors — to get real fixes rather than ticket numbers. Manage RMA flows when hardware needs to come out.
- Monitor and improve GPU hardware health signals and turn them into actionable reliability improvements.
- Write clear postmortems and vendor cases that move issues forward.
Skills and Qualifications
Minimum qualifications:
- Bachelor’s degree or equivalent experience in computer science, engineering, or similar.
- Proficiency in at least one backend language (we use Python or Rust).
- Experience operating large‑scale clusters and container orchestration systems (e.g. Kubernetes or Slurm).
- Comfort operating across the stack and owning projects end-to-end.
- Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
- A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.
Preferred qualifications:
- Fluency with Linux systems and debugging tools.
- Proven statistical rigor in analyzing reliability.
- A track record of debugging a problem from application symptom to the root cause in hardware.
- Comfort reading vendor errata, firmware release notes, and kernel changelogs.
- Experience engaging hardware vendors directly — not just through escalation portals.
- Linux kernel literacy: the scheduler, memory management, IRQ paths, and the driver model.
- Out-of-band management experience: BMC / iDRAC / IPMI / Redfish.
- Depth in GPU hardware health: Xid error taxonomy, NVLink, NVSwitch, fabric manager, and DCGM.
- Proficiency in at least one backend language (we use Python and Rust).
- Significant ownership of the hardware reliability function at scale.
- Strong writing skills for vendor cases and postmortems.
- An instinct for telling apart a flaky machine, a flaky workload, and a flaky test.
Network Engineer, Supercomputing
Own and debug multi-thousand-GPU network fabric (RDMA/RoCE, NVLink/NVSwitch) for large-scale AI training and inference. Requires backend language proficiency, large-scale cluster experience, and cross-stack ownership.
Staff Software Engineer, Developer Productivity
Staff-level IC role owning end-to-end CI/CD, merge queue, and deploy pipelines for Anthropic's engineering org. Focus on AI-assisted review, test reliability, and progressive delivery at monorepo scale.
Staff Software Engineer, Developer Productivity
Staff-level engineer to own end-to-end development environments at Anthropic, focusing on container lifecycle, cold-start optimization, environment isolation, and pre-push validation for AI researchers and engineers.
Staff Software Engineer, Node Infra
Own technical strategy and roadmap for node lifecycle management, health automation, and scaling AI clusters across clouds and accelerators. Requires deep distributed systems expertise, ML accelerator experience, and 12+ years leading complex multi-team infrastructure initiatives.
Staff Software Engineer, Kubernetes Platform
Senior-level engineer to own and scale Anthropic's massive Kubernetes control plane and scheduler for training frontier AI models across hundreds of thousands of nodes. Requires deep Kubernetes internals experience and 12+ years building production distributed systems.