Network Engineer, Supercomputing

350k – 475kSan Francisco, CADevOps / SREOnsiteJun 24

Summary

Own and debug multi-thousand-GPU network fabric (RDMA/RoCE, NVLink/NVSwitch) for large-scale AI training and inference. Requires backend language proficiency, large-scale cluster experience, and cross-stack ownership.

About the role

What You’ll Do

Reason about and validate GPU network fabric design across our deployments.
Debug RDMA / RoCEv2 across different NIC vendors. Diagnose collective failures of production NCCL, PFC/ECN tuning, and congestion control behavior.
Own NVLink / NVSwitch interconnect — including fabric manager and IMEX health, link and lane errors, and how the GPU fabric interacts with collectives.
Build host-level network instrumentation and use Linux tooling to build dashboards and alerts, not just the bug report.
Navigate cross-cloud fabric quirks across providers and triage across the NIC, driver, kernel, switch, and workload boundaries.
Drive escalations with cloud-provider networking teams, owning issues end-to-end until they're resolved.

Skills and Qualifications

Minimum qualifications:

Bachelor’s degree or equivalent experience in computer science, engineering, or similar.
Proficiency in at least one backend language (we use Python or Rust).
Experience operating large‑scale clusters and container orchestration systems (e.g. Kubernetes or Slurm).
Comfort operating across the stack and owning projects end-to-end.
Thrive in a highly collaborative environment involving many, different cross-functional partners and subject matter experts.
A bias for action with a mindset to take initiative to work across different stacks and different teams where you spot the opportunity to make sure something ships.

Preferred qualifications:

Fluency with host-level debugging tools on Linux.
Strong communication skills, internally and with cloud providers.
Extensive experience with at least one of the following:
- Familiarity with cloud network primitives across at least two cloud providers.
- Hands-on experience with NVLink / NVSwitch, fabric manager, and IMEX.
- Statistical rigor in reliability reasoning — comfort reasoning about failure and error rates, distributions, and base rates, and the judgment to separate signal from noise when characterizing a large fabric.
- A track record of writing tooling that made the next debugging session meaningfully faster.
- Familiarity with CUDA/NCCL and performance profiling for distributed training and inference.
- Understanding of deep learning frameworks and their underlying system architectures.

Compensation and Benefits

Compensation: $350,000 - $475,000 USD annual salary range.
Benefits: Generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.
Visa sponsorship available.

Skills

PythonRustKubernetesSlurmRDMARoCEv2NCCLNVLinkNVSwitchCUDALinux

Similar roles at this salary range

All DevOps / SRE jobs →

Thinking Machines Lab

Jun 24

Reliability Engineer, Supercomputing

Ensure reliability of large GPU supercomputing clusters by diagnosing hardware/firmware/OS issues, automating monitoring, driving firmware rollouts, and working directly with vendors.

350k – 475kSan Francisco, CADevOps / SREOn-siteBMCRust

Anthropic

Jun 19

Staff Software Engineer, Developer Productivity

Staff-level IC role owning end-to-end CI/CD, merge queue, and deploy pipelines for Anthropic's engineering org. Focus on AI-assisted review, test reliability, and progressive delivery at monorepo scale.

405k – 485kSan Francisco, CA +1DevOps / SREHybrid7+ YOEGoRust

Anthropic

Jun 19

Staff Software Engineer, Developer Productivity

Staff-level engineer to own end-to-end development environments at Anthropic, focusing on container lifecycle, cold-start optimization, environment isolation, and pre-push validation for AI researchers and engineers.

405k – 485kSan Francisco, CA +1DevOps / SREHybrid7+ YOEGoNix

Anthropic

Jun 17

Staff Software Engineer, Node Infra

Own technical strategy and roadmap for node lifecycle management, health automation, and scaling AI clusters across clouds and accelerators. Requires deep distributed systems expertise, ML accelerator experience, and 12+ years leading complex multi-team infrastructure initiatives.

405k – 485kSan Francisco, CA +2DevOps / SREHybrid12+ YOEGoAWS

Anthropic

Jun 17

Staff Software Engineer, Kubernetes Platform

Senior-level engineer to own and scale Anthropic's massive Kubernetes control plane and scheduler for training frontier AI models across hundreds of thousands of nodes. Requires deep Kubernetes internals experience and 12+ years building production distributed systems.

405k – 485kSan Francisco, CA +2DevOps / SREHybrid12+ YOEGoC++

Apply