Principal/Staff HPC Network Engineer

Designs, deploys, and maintains high-performance networks for large-scale GPU clusters in HPC environments. Requires 10+ years experience with InfiniBand/RoCEv2 in CLOS topologies, automation, and hybrid work in San Francisco.

250k – 325kSan Francisco, CADevOps / SREHybrid10+ YOE

Apply

About the role

Responsibilities

Architect and deploy new GPU clusters around the world.
Keep clusters running smoothly, participate in on-call rotation, and fix issues.
Lean into automation to enable deployments at scale.
Shape team culture, mentor junior engineers, and learn from customers.

Requirements

10+ years of experience with hands-on management or architecture of networks for at least one GPU cluster (ideally >1k GPUs).
Deep understanding of Ethernet (RoCEv2) and/or InfiniBand networks in CLOS/fat-tree topologies.
Experience building HPC network architectures (eBGP, fat-tree, VXLAN, MCLAG, etc.).
Excitement for implementing zero-touch provisioning for large multi-layer networks.
Strong documentation skills.
Willingness to mentor junior engineers.
Open to coming into San Francisco office 3-4 days per week.

Nice to Haves

Understanding of data center concepts including power, cooling, and colo providers.
Linux systems administration experience, including kernel drivers and network stack tuning.
Experience with Linux virtualization (KVM, QEMU, libvirt).
Exposure to containers and Kubernetes operators.

Benefits

Competitive salary with generous equity grant.
401(k) matching up to 4%.
100% covered medical, dental, vision premiums.
Unlimited PTO + 10+ holidays.
Parental leave.
Daily lunch, unlimited office book budget.
Visa sponsorship available.

Skills

InfiniBandRocev2EthernetClosFat-TreeEbgpVxlanMclagKubernetesLinuxKvmQemuLibvirt

Similar roles

DevOps / SRE jobs

Docker

Senior Principal Software Engineer, Infrastructure

Technical visionary architecting Docker's foundational platform for accounts, billing, data, governance, and infrastructure. Drives cross-company strategy enabling enterprise growth, requiring 12+ years experience in large-scale distributed systems.

251k – 352kSeattle, WADevOps / SRERemote12+ YOEAWSGCP

Crusoe

Principal Systems Software Engineer

Leads architecture of next-generation AI infrastructure, unifying BMaaS, IaaS, and CaaS with focus on high-performance I/O paths, kernel optimizations, and GPU workloads. Requires 12+ years hyperscale experience, deep Linux/virtualization expertise, and hardware-software co-design skills.

260k – 340kSan Francisco, CA +1DevOps / SREOn-site12+ YOEKvmQemu

Crusoe

Principal Production Engineer

Owns reliability, scalability, and observability of cloud infrastructure including compute, storage, and networking at massive scale. Drives SLOs, incident response, tooling, and mentors engineers; requires 15+ years experience with data centers and internet-scale operations.

261k – 326kSan Francisco, CA +1DevOps / SREOn-site15+ YOEBGPOspf

Crusoe

Principal Software Engineer, SDN Networking

Leads development of Software Defined Networking strategy using kernel bypass technologies like XDP/EBPF, DPDK, and SmartNICs. Guides team on architecture, Linux kernel development, and production-scale network infrastructure for AI cloud workloads. Requires 10+ years experience in systems programming with C/C++/Rust.

238k – 298kSan Francisco, CA +1DevOps / SREOn-site10+ YOECC++

Databricks

Principal Engineer, Compute Fleet Management

Leads compute fleet management across AWS, Azure, and GCP, optimizing billions of resources for peak performance, 99.99% availability, and 60%+ utilization. Requires deep distributed systems expertise and cross-team leadership for mission-critical infrastructure.

264k – 322kBellevue, WADevOps / SREOn-siteAWSGCP