Responsibilities
- Design, implement, and optimize efficient distributed training systems for models with thousands of GPUs
- Research and implement advanced parallelization techniques (FSDP, Tensor Parallel, Pipeline Parallel, Expert Parallel)
- Build monitoring, visualization, and debugging tools for large-scale training runs
- Optimize training stability, convergence, and resource utilization across massive clusters
Experience
- Extensive experience with distributed PyTorch training and parallelisms in foundation model training
- Deep understanding of GPU clusters, networking, and storage systems
- Familiarity with communication libraries (NCCL, MPI) and distributed system optimization
(Preferred)
- Strong Linux systems administration and scripting capabilities
- Experience managing training runs across >100 GPUs
- Experience with containerization, orchestration, and cloud infrastructure
Compensation
Base pay range: $187,500 – $395,000 per year