Research Engineer - AI/RL Infrastructure
126k – 423kSunnyvale, CAOnsite
Summary
Designs, builds, and operates large-scale ML infrastructure for AI/RL research, including GPU cluster orchestration, data curation pipelines, and distributed training systems for autonomous driving and robotics.
About the role
Responsibilities
- Design and build training and evaluation infrastructure to support AI research directions, orchestrating massive GPU clusters to process PBs of multimodal sensor data
- Build robust benchmarking, continuous evaluation, and regression tracking systems to measure model performance across diverse, long-tail real-world driving distributions
- Develop large-scale data sampling, dataset generation, and advanced data curation pipelines, leveraging state-of-the-art AI models to power a closed-loop data flywheel
- Enable high-throughput distributed training across heterogeneous cloud environments, focusing on reliability, efficiency, and cost-aware scaling
- Collaborate closely with AI research, autonomy, and platform teams to translate cutting-edge research into production-ready systems
Requirements
- Experience building and operating production-grade software systems across the full machine learning lifecycle, including training, evaluation, data, and deployment
- Opinions about building a company-wide platform for ML training, evaluation, and deployment
- Experience with performance engineering and compute acceleration for large-scale ML training, including profiling, bottleneck analysis, and optimization
- Strong systems-level debugging skills to diagnose and resolve issues in large-scale distributed training, spanning model code, data pipelines, runtimes, and cluster infrastructure
- Deep familiarity with the open-source ML and systems ecosystem, with judgment on when to adopt open source versus build in-house
- Technical experience in: PyTorch, CUDA, Ray, Flyte, Kubernetes
Nice to Have
- Industry experience on relevant topics (self-driving application preferred)
Compensation
Base salary range: $126,000 - $423,000 USD annually, plus equity and benefits.
Skills
PyTorchCUDARayFlyteKubernetesGPU clustersdistributed trainingML infrastructuredata pipelinesperformance engineering
Similar roles at this salary range
All ML Engineering jobs →Machine Learning Platform Engineer
Build and operate the ML platform to productionize models, enable real-time inference, and manage the full ML lifecycle with MLOps best practices. Requires 3+ years platform engineering and 1+ years owning ML systems end-to-end.
135k – 160kUnited StatesML EngineeringRemoteGoKafka