Research Engineer, Infrastructure
San Francisco, CADevOps / SREOnsite
Summary
Builds and owns distributed training infrastructure, experiment orchestration, data pipelines, and performance optimizations for large-scale AI research on GPU clusters. Requires deep systems expertise, Python/C++/PyTorch proficiency, and ML understanding to accelerate frontier research.
About the role
Responsibilities
- Build and own distributed training infrastructure for large-scale GPU clusters, including job launchers, checkpointing, recovery, fault tolerance, and monitoring.
- Own infrastructure for scaling agent rollouts in VM sandboxes at RL training scales.
- Profile and optimize training throughput: data loading, communication, memory, compute efficiency to improve step time and MFU.
- Design experiment orchestration and tooling to launch, track, and analyze experiments.
- Build high-throughput, reliable data pipelines for training and evaluation.
- Debug and resolve training failures across GPUs, networking, numerics, and data.
- Implement and optimize parallelism strategies: data, tensor, pipeline, sequence.
- Anticipate research needs and build scaling infrastructure proactively.
Requirements
- Deep experience building/operating distributed training systems for large models.
- Strong systems engineering: distributed systems, networking, storage, performance reasoning.
- Proficiency in Python, C++; systems-level PyTorch or equivalent.
- Hands-on GPU performance profiling, memory optimization, compute efficiency.
- Experience with parallelism strategies for large model training.
- Track record building tooling to accelerate research workflows.
- Strong debugging in complex distributed systems.
- ML knowledge to engage with researchers.
Skills
PythonC++PyTorchDistributed SystemsGPU ProgrammingParallelismData PipelinesExperiment OrchestrationPerformance ProfilingNetworking