Responsibilities

Build and own distributed training infrastructure for large-scale GPU clusters, including job launchers, checkpointing, recovery, fault tolerance, and monitoring.
Own infrastructure for scaling agent rollouts in VM sandboxes at RL training scales.
Profile and optimize training throughput: data loading, communication, memory, compute efficiency to improve step time and MFU.
Design experiment orchestration and tooling to launch, track, and analyze experiments.
Build high-throughput, reliable data pipelines for training and evaluation.
Debug and resolve training failures across GPUs, networking, numerics, and data.
Implement and optimize parallelism strategies: data, tensor, pipeline, sequence.
Anticipate research needs and build scaling infrastructure proactively.

Requirements

Deep experience building/operating distributed training systems for large models.
Strong systems engineering: distributed systems, networking, storage, performance reasoning.
Proficiency in Python, C++; systems-level PyTorch or equivalent.
Hands-on GPU performance profiling, memory optimization, compute efficiency.
Experience with parallelism strategies for large model training.
Track record building tooling to accelerate research workflows.
Strong debugging in complex distributed systems.
ML knowledge to engage with researchers.