Skip to content

Research Engineer - AI/RL Infrastructure

126k – 423kSunnyvale, CAOnsite
Summary

Designs, builds, and operates large-scale ML infrastructure for AI/RL research, including GPU cluster orchestration, data curation pipelines, and distributed training systems for autonomous driving and robotics.

About the role

Responsibilities

  • Design and build training and evaluation infrastructure to support AI research directions, orchestrating massive GPU clusters to process PBs of multimodal sensor data
  • Build robust benchmarking, continuous evaluation, and regression tracking systems to measure model performance across diverse, long-tail real-world driving distributions
  • Develop large-scale data sampling, dataset generation, and advanced data curation pipelines, leveraging state-of-the-art AI models to power a closed-loop data flywheel
  • Enable high-throughput distributed training across heterogeneous cloud environments, focusing on reliability, efficiency, and cost-aware scaling
  • Collaborate closely with AI research, autonomy, and platform teams to translate cutting-edge research into production-ready systems

Requirements

  • Experience building and operating production-grade software systems across the full machine learning lifecycle, including training, evaluation, data, and deployment
  • Opinions about building a company-wide platform for ML training, evaluation, and deployment
  • Experience with performance engineering and compute acceleration for large-scale ML training, including profiling, bottleneck analysis, and optimization
  • Strong systems-level debugging skills to diagnose and resolve issues in large-scale distributed training, spanning model code, data pipelines, runtimes, and cluster infrastructure
  • Deep familiarity with the open-source ML and systems ecosystem, with judgment on when to adopt open source versus build in-house
  • Technical experience in: PyTorch, CUDA, Ray, Flyte, Kubernetes

Nice to Have

  • Industry experience on relevant topics (self-driving application preferred)

Compensation

Base salary range: $126,000 - $423,000 USD annually, plus equity and benefits.

Skills
PyTorchCUDARayFlyteKubernetesGPU clustersdistributed trainingML infrastructuredata pipelinesperformance engineering
Similar roles at this salary range
All ML Engineering jobs →
Chime

AI/ML Engineer

Build and productionize ML models for risk detection and decisioning systems. Requires 1-2 years applied ML experience and familiarity with AWS, model evaluation, and experimentation.

125k – 173kSan Francisco, CAML EngineeringHybridAWSPython
PrizePicks

Machine Learning Platform Engineer

Build and operate the ML platform to productionize models, enable real-time inference, and manage the full ML lifecycle with MLOps best practices. Requires 3+ years platform engineering and 1+ years owning ML systems end-to-end.

135k – 160kUnited StatesML EngineeringRemoteGoKafka
Ai2

Senior Research Engineer

Research Engineer training and scaling flagship open multimodal and agentic models (Olmo, Molmo). Owns end-to-end ML infrastructure, model development, and open-source releases.

147k – 220kSeattle, WAML EngineeringOn-siteJAXMoE
Clickhouse

AI Product Engineer

Build agentic capabilities on a petabyte-scale observability platform. Own the full agent stack including context engineering, tool design, evals, and production reliability for incident investigation.

130k – 230kUnited StatesML EngineeringRemoteMCPSQL
Kodex

ML Engineer

Build and deploy production ML models and pipelines to detect suspicious activity, improve verification accuracy, and support threat intelligence workflows.

150k – 180kUnited StatesML EngineeringRemoteAWSClustering