AI/HPC Network Development Engineer - Networking
Palo Alto, CAMemphis, TNOnsite10+ YOE
Summary
Develops and optimizes high-performance ethernet networks for massive AI/HPC GPU clusters using RoCEv2 and NCCL. Requires 10+ years network experience, 5+ in AI/HPC ethernet, Python automation, with travel to data centers.
About the role
Required Qualifications
- Minimum 10 years designing/operating large scale networks, 5+ years in ethernet AI/HPC
- Deep understanding of ethernet congestion control (RoCEv2); Infiniband bonus
- Deep knowledge of AI training/inference workloads, NCCL usage/debugging
- Expertise in performance/operations metrics for training/inference optimization
- Python experience for automation and large data analysis
Skills
RoCEv2NCCLEthernetInfinibandPythonCongestion ControlAI TrainingHPC NetworkingMetrics DashboardsNetwork Automation