Research Engineer - AI/RL Infrastructure

126k – 423kSunnyvale, CAOnsiteApr 21

Summary

Designs, builds, and operates large-scale ML infrastructure for AI/RL research, including GPU cluster orchestration, data curation pipelines, and distributed training systems for autonomous driving and robotics.

About the role

Responsibilities

Design and build training and evaluation infrastructure to support AI research directions, orchestrating massive GPU clusters to process PBs of multimodal sensor data
Build robust benchmarking, continuous evaluation, and regression tracking systems to measure model performance across diverse, long-tail real-world driving distributions
Develop large-scale data sampling, dataset generation, and advanced data curation pipelines, leveraging state-of-the-art AI models to power a closed-loop data flywheel
Enable high-throughput distributed training across heterogeneous cloud environments, focusing on reliability, efficiency, and cost-aware scaling
Collaborate closely with AI research, autonomy, and platform teams to translate cutting-edge research into production-ready systems

Requirements

Experience building and operating production-grade software systems across the full machine learning lifecycle, including training, evaluation, data, and deployment
Opinions about building a company-wide platform for ML training, evaluation, and deployment
Experience with performance engineering and compute acceleration for large-scale ML training, including profiling, bottleneck analysis, and optimization
Strong systems-level debugging skills to diagnose and resolve issues in large-scale distributed training, spanning model code, data pipelines, runtimes, and cluster infrastructure
Deep familiarity with the open-source ML and systems ecosystem, with judgment on when to adopt open source versus build in-house
Technical experience in: PyTorch, CUDA, Ray, Flyte, Kubernetes

Nice to Have

Industry experience on relevant topics (self-driving application preferred)

Compensation

Base salary range: $126,000 - $423,000 USD annually, plus equity and benefits.

Skills

PyTorchCUDARayFlyteKubernetesGPU clustersdistributed trainingML infrastructuredata pipelinesperformance engineering

Similar roles at this salary range

All ML Engineering jobs →

Chime

Jun 8

AI/ML Engineer

Build and productionize ML models for risk detection and decisioning systems. Requires 1-2 years applied ML experience and familiarity with AWS, model evaluation, and experimentation.

125k – 173kSan Francisco, CAML EngineeringHybridAWSPython

PrizePicks

Jun 5

Machine Learning Platform Engineer

Build and operate the ML platform to productionize models, enable real-time inference, and manage the full ML lifecycle with MLOps best practices. Requires 3+ years platform engineering and 1+ years owning ML systems end-to-end.

135k – 160kUnited StatesML EngineeringRemoteGoKafka

Ai2

Jun 5

Senior Research Engineer

Research Engineer training and scaling flagship open multimodal and agentic models (Olmo, Molmo). Owns end-to-end ML infrastructure, model development, and open-source releases.

147k – 220kSeattle, WAML EngineeringOn-siteJAXMoE

Clickhouse

Jun 5

AI Product Engineer

Build agentic capabilities on a petabyte-scale observability platform. Own the full agent stack including context engineering, tool design, evals, and production reliability for incident investigation.

130k – 230kUnited StatesML EngineeringRemoteMCPSQL

Kodex

Jun 4

ML Engineer

Build and deploy production ML models and pipelines to detect suspicious activity, improve verification accuracy, and support threat intelligence workflows.

150k – 180kUnited StatesML EngineeringRemoteAWSClustering

Apply