Skip to content

AI/HPC Network Development Engineer - Networking

Palo Alto, CAMemphis, TNOnsite10+ YOE
Summary

Develops and optimizes high-performance ethernet networks for massive AI/HPC GPU clusters using RoCEv2 and NCCL. Requires 10+ years network experience, 5+ in AI/HPC ethernet, Python automation, with travel to data centers.

About the role

Required Qualifications

  • Minimum 10 years designing/operating large scale networks, 5+ years in ethernet AI/HPC
  • Deep understanding of ethernet congestion control (RoCEv2); Infiniband bonus
  • Deep knowledge of AI training/inference workloads, NCCL usage/debugging
  • Expertise in performance/operations metrics for training/inference optimization
  • Python experience for automation and large data analysis
Skills
RoCEv2NCCLEthernetInfinibandPythonCongestion ControlAI TrainingHPC NetworkingMetrics DashboardsNetwork Automation