Research Infrastructure Engineer, Training Systems

295k – 380kSan Francisco, CAHybridApr 27

Summary

Builds and maintains infrastructure for large-scale ML model training and experimentation. Designs APIs, improves reliability and performance of training pipelines, and debugs issues across Python, PyTorch, distributed systems, GPUs, and networking.

About the role

Responsibilities

Build and maintain infrastructure for large-scale model training and experimentation.
Design APIs and interfaces that make complex training workflows easier to express and harder to misuse.
Improve reliability, debuggability, and performance across training and data pipelines.
Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage.
Write tests, benchmarks, and diagnostics that catch meaningful regressions.

Skills

PyTorchPythonDistributed SystemsGPUsKubernetesML Training InfrastructureData PipelinesNetworkingStorage SystemsPerformance Optimization

Similar roles at this salary range

All ML Engineering jobs →

Anthropic

Jun 8

Staff Software Engineer, Inference

Build and maintain distributed inference systems serving Claude to millions of users. Design intelligent routing, autoscaling, and high-performance infrastructure across diverse AI accelerators.

320k – 485kSan Francisco, CA +2ML EngineeringHybridAWSGCP

Airbnb

Jun 8

Senior Staff Machine Learning Engineer, Communication & Connectivity

Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.

244k – 305kUnited StatesML EngineeringRemotePythonAI Systems

Traba

Jun 8

Staff Software Engineer

Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.

240k – 300kNew York, NY +1ML EngineeringOn-siteLLMKafka

Nuance Labs

Jun 5

Member of Technical Staff — Model Optimization and Inference

Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.

250k – 350kSeattle, WAML EngineeringOn-siteAWQvLLM

OpenAI

Jun 5

Researcher: Agent Post-Training, API & Power-Users

Improve agentic model capabilities for API and power users by designing experiments, building evals from real workflows, and driving post-training interventions from discovery through launch.

295k – 445kSan Francisco, CAML EngineeringHybridRLLLMs

Apply