Skip to content

Research Engineer, Infrastructure

San Francisco, CADevOps / SREOnsite
Summary

Builds and owns distributed training infrastructure, experiment orchestration, data pipelines, and performance optimizations for large-scale AI research on GPU clusters. Requires deep systems expertise, Python/C++/PyTorch proficiency, and ML understanding to accelerate frontier research.

About the role

Responsibilities

  • Build and own distributed training infrastructure for large-scale GPU clusters, including job launchers, checkpointing, recovery, fault tolerance, and monitoring.
  • Own infrastructure for scaling agent rollouts in VM sandboxes at RL training scales.
  • Profile and optimize training throughput: data loading, communication, memory, compute efficiency to improve step time and MFU.
  • Design experiment orchestration and tooling to launch, track, and analyze experiments.
  • Build high-throughput, reliable data pipelines for training and evaluation.
  • Debug and resolve training failures across GPUs, networking, numerics, and data.
  • Implement and optimize parallelism strategies: data, tensor, pipeline, sequence.
  • Anticipate research needs and build scaling infrastructure proactively.

Requirements

  • Deep experience building/operating distributed training systems for large models.
  • Strong systems engineering: distributed systems, networking, storage, performance reasoning.
  • Proficiency in Python, C++; systems-level PyTorch or equivalent.
  • Hands-on GPU performance profiling, memory optimization, compute efficiency.
  • Experience with parallelism strategies for large model training.
  • Track record building tooling to accelerate research workflows.
  • Strong debugging in complex distributed systems.
  • ML knowledge to engage with researchers.
Skills
PythonC++PyTorchDistributed SystemsGPU ProgrammingParallelismData PipelinesExperiment OrchestrationPerformance ProfilingNetworking