Network Engineer, Supercomputing
Own and debug multi-thousand-GPU network fabric (RDMA/RoCE, NVLink/NVSwitch) for large-scale AI training and inference. Requires backend language proficiency, large-scale cluster experience, and cross-stack ownership.
What You’ll Do
- Reason about and validate GPU network fabric design across our deployments.
- Debug RDMA / RoCEv2 across different NIC vendors. Diagnose collective failures of production NCCL, PFC/ECN tuning, and congestion control behavior.
- Own NVLink / NVSwitch interconnect — including fabric manager and IMEX health, link and lane errors, and how the GPU fabric interacts with collectives.
- Build host-level network instrumentation and use Linux tooling to build dashboards and alerts, not just the bug report.
- Navigate cross-cloud fabric quirks across providers and triage across the NIC, driver, kernel, switch, and workload boundaries.
- Drive escalations with cloud-provider networking teams, owning issues end-to-end until they're resolved.
Skills and Qualifications
Minimum qualifications:
- Bachelor’s degree or equivalent experience in computer science, engineering, or similar.
- Proficiency in at least one backend language (we use Python or Rust).
- Experience operating large‑scale clusters and container orchestration systems (e.g. Kubernetes or Slurm).
- Comfort operating across the stack and owning projects end-to-end.
- Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
- A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.
Preferred qualifications:
- Fluency with host-level debugging tools on Linux.
- Strong communication skills, internally and with cloud providers.
- Extensive experience with at least one of the following:
- Familiarity with cloud network primitives across at least two cloud providers.
- Hands-on experience with NVLink / NVSwitch, fabric manager, and IMEX.
- Statistical rigor in reliability reasoning — comfort reasoning about failure and error rates, distributions, and base rates, and the judgment to separate signal from noise when characterizing a large fabric.
- A track record of writing tooling that made the next debugging session meaningfully faster.
- Familiarity with CUDA/NCCL and performance profiling for distributed training and inference.
- Understanding of deep learning frameworks and their underlying system architectures.
Compensation and Benefits
- Compensation: $350,000 - $475,000 USD annual salary range.
- Benefits: Generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.
- Visa sponsorship available.
Staff Software Engineer, Developer Productivity
Staff-level IC role owning end-to-end CI/CD, merge queue, and deploy pipelines for Anthropic's engineering org. Focus on AI-assisted review, test reliability, and progressive delivery at monorepo scale.
Staff Software Engineer, Developer Productivity
Staff-level engineer to own end-to-end development environments at Anthropic, focusing on container lifecycle, cold-start optimization, environment isolation, and pre-push validation for AI researchers and engineers.
Staff Software Engineer, Node Infra
Own technical strategy and roadmap for node lifecycle management, health automation, and scaling AI clusters across clouds and accelerators. Requires deep distributed systems expertise, ML accelerator experience, and 12+ years leading complex multi-team infrastructure initiatives.
Staff Software Engineer, Kubernetes Platform
Senior-level engineer to own and scale Anthropic's massive Kubernetes control plane and scheduler for training frontier AI models across hundreds of thousands of nodes. Requires deep Kubernetes internals experience and 12+ years building production distributed systems.