Skip to content

Principal/Staff HPC Network Engineer

Designs, deploys, and maintains high-performance networks for large-scale GPU clusters in HPC environments. Requires 10+ years experience with InfiniBand/RoCEv2 in CLOS topologies, automation, and hybrid work in San Francisco.

250k – 325kSan Francisco, CADevOps / SREHybrid10+ YOE

About the role

Responsibilities

  • Architect and deploy new GPU clusters around the world.
  • Keep clusters running smoothly, participate in on-call rotation, and fix issues.
  • Lean into automation to enable deployments at scale.
  • Shape team culture, mentor junior engineers, and learn from customers.

Requirements

  • 10+ years of experience with hands-on management or architecture of networks for at least one GPU cluster (ideally >1k GPUs).
  • Deep understanding of Ethernet (RoCEv2) and/or InfiniBand networks in CLOS/fat-tree topologies.
  • Experience building HPC network architectures (eBGP, fat-tree, VXLAN, MCLAG, etc.).
  • Excitement for implementing zero-touch provisioning for large multi-layer networks.
  • Strong documentation skills.
  • Willingness to mentor junior engineers.
  • Open to coming into San Francisco office 3-4 days per week.

Nice to Haves

  • Understanding of data center concepts including power, cooling, and colo providers.
  • Linux systems administration experience, including kernel drivers and network stack tuning.
  • Experience with Linux virtualization (KVM, QEMU, libvirt).
  • Exposure to containers and Kubernetes operators.

Benefits

  • Competitive salary with generous equity grant.
  • 401(k) matching up to 4%.
  • 100% covered medical, dental, vision premiums.
  • Unlimited PTO + 10+ holidays.
  • Parental leave.
  • Daily lunch, unlimited office book budget.
  • Visa sponsorship available.

Skills

InfiniBandRocev2EthernetClosFat-TreeEbgpVxlanMclagKubernetesLinuxKvmQemuLibvirt

Similar roles

DevOps / SRE jobs

Senior Principal Software Engineer, Infrastructure

Technical visionary architecting Docker's foundational platform for accounts, billing, data, governance, and infrastructure. Drives cross-company strategy enabling enterprise growth, requiring 12+ years experience in large-scale distributed systems.

251k – 352kSeattle, WADevOps / SRERemote12+ YOEAWSGCP

Principal Systems Software Engineer

Leads architecture of next-generation AI infrastructure, unifying BMaaS, IaaS, and CaaS with focus on high-performance I/O paths, kernel optimizations, and GPU workloads. Requires 12+ years hyperscale experience, deep Linux/virtualization expertise, and hardware-software co-design skills.

260k – 340kSan Francisco, CA +1DevOps / SREOn-site12+ YOEKvmQemu

Principal Production Engineer

Owns reliability, scalability, and observability of cloud infrastructure including compute, storage, and networking at massive scale. Drives SLOs, incident response, tooling, and mentors engineers; requires 15+ years experience with data centers and internet-scale operations.

261k – 326kSan Francisco, CA +1DevOps / SREOn-site15+ YOEBGPOspf

Principal Software Engineer, SDN Networking

Leads development of Software Defined Networking strategy using kernel bypass technologies like XDP/EBPF, DPDK, and SmartNICs. Guides team on architecture, Linux kernel development, and production-scale network infrastructure for AI cloud workloads. Requires 10+ years experience in systems programming with C/C++/Rust.

238k – 298kSan Francisco, CA +1DevOps / SREOn-site10+ YOECC++

Principal Engineer, Compute Fleet Management

Leads compute fleet management across AWS, Azure, and GCP, optimizing billions of resources for peak performance, 99.99% availability, and 60%+ utilization. Requires deep distributed systems expertise and cross-team leadership for mission-critical infrastructure.

264k – 322kBellevue, WADevOps / SREOn-siteAWSGCP