# Research Scientist / Engineer – Training Infrastructure
**Company:** [Luma AI](https://hotfix.jobs/companies/lumalabs-ai)
**Location:** Palo Alto, CA
**Salary:** $188K-$395K
**Skills:** PyTorch, CUDA, Distributed Systems, Fsdp, Tensor Parallel, Pipeline Parallel, Expert Parallel, Nccl, Mpi, Kubernetes
**Posted:** 2025-03-12
> Builds and optimizes distributed training infrastructure for large-scale multimodal AI models across thousands of GPUs. Requires deep expertise in PyTorch, CUDA, parallelization techniques, and GPU clusters.
## Job Description
## Responsibilities
- Design, implement, and optimize efficient distributed training systems for models with thousands of GPUs
- Research and implement advanced parallelization techniques (FSDP, Tensor Parallel, Pipeline Parallel, Expert Parallel)
- Build monitoring, visualization, and debugging tools for large-scale training runs
- Optimize training stability, convergence, and resource utilization across massive clusters

## Experience
- Extensive experience with distributed PyTorch training and parallelisms in foundation model training
- Deep understanding of GPU clusters, networking, and storage systems
- Familiarity with communication libraries (NCCL, MPI) and distributed system optimization

**(Preferred)**
- Strong Linux systems administration and scripting capabilities
- Experience managing training runs across >100 GPUs
- Experience with containerization, orchestration, and cloud infrastructure

## Compensation
Base pay range: $187,500 – $395,000 per year
**Apply:** https://hotfix.jobs/jobs/research-scientist-engineer-training-infrastructure-at-lumalabs-ai-df564675-5da1-475c-a6b4-e81cc62da4eb
**Canonical:** https://hotfix.jobs/jobs/research-scientist-engineer-training-infrastructure-at-lumalabs-ai-df564675-5da1-475c-a6b4-e81cc62da4eb