Research Scientist / Engineer – Training Infrastructure

Builds and optimizes distributed training infrastructure for large-scale multimodal AI models across thousands of GPUs. Requires deep expertise in PyTorch, CUDA, parallelization techniques, and GPU clusters.

188k – 395kPalo Alto, CAML EngineeringHybrid

Apply

About the role

Responsibilities

Design, implement, and optimize efficient distributed training systems for models with thousands of GPUs
Research and implement advanced parallelization techniques (FSDP, Tensor Parallel, Pipeline Parallel, Expert Parallel)
Build monitoring, visualization, and debugging tools for large-scale training runs
Optimize training stability, convergence, and resource utilization across massive clusters

Experience

Extensive experience with distributed PyTorch training and parallelisms in foundation model training
Deep understanding of GPU clusters, networking, and storage systems
Familiarity with communication libraries (NCCL, MPI) and distributed system optimization

(Preferred)

Strong Linux systems administration and scripting capabilities
Experience managing training runs across >100 GPUs
Experience with containerization, orchestration, and cloud infrastructure

Compensation

Base pay range: $187,500 – $395,000 per year

Skills

PyTorchCUDADistributed SystemsFsdpTensor ParallelPipeline ParallelExpert ParallelNcclMpiKubernetes

Similar roles

ML Engineering jobs

Luma AI

Software Engineer, Inference

Develops and optimizes inference engines for multimodal AI models, integrating new architectures, building scheduling systems, and managing large-scale GPU deployments. Requires strong Python, model serving frameworks like PyTorch/vLLM, and Kubernetes expertise.

188k – 395kPalo Alto, CAML EngineeringHybridvLLMLinux

Luma AI

Software Engineer, ML Platform

Builds foundational ML platform infrastructure including model serving pipelines, GPU scheduling systems, and CI/CD for large-scale multimodal AI models. Requires 5+ years in distributed systems with expertise in Python, Kubernetes, and AWS.

188k – 395kPalo Alto, CAML EngineeringHybrid5+ YOES3AWS

Machine Learning Engineer

Design, train, and deploy large-scale ML recommendation systems and models that power personalization and discovery on Reddit. Requires a Master's degree and 3+ years building production ML systems.

188k – 260kSan Francisco, CAML EngineeringRemote3+ YOES3AWS

Chime

Software Engineer, Machine Learning Platform

Build and operate Chime's ML platform on AWS, including distributed training systems, feature stores, data pipelines, and CI/CD tooling. Partner with ML teams to improve reliability, observability, and developer experience for production models.

187k – 259kSan Francisco, CAML EngineeringHybrid5+ YOEGoAWS

EigenLayer

Agentic AI Engineer

Builds production-ready agentic AI systems including runtimes, orchestration, reliability, observability, and integrations with LLMs/APIs. Requires strong backend experience, shipped agent/LLM systems, and production reliability expertise.

187k – 253kSeattle, WA +1ML EngineeringRemoteGoRust