Software Engineer, ML Systems & Training Architecture

295k – 380kSan Francisco, CAOnsite3+ YOEMay 22

Summary

Hands-on senior software engineer focused on maintaining and improving ML training infrastructure, debugging training systems, and unblocking researchers on the robotics team.

About the role

Responsibilities

Review, improve, and clean up code across training frameworks and adjacent infrastructure
Identify risky or low-quality changes before they land, and raise the code quality bar without slowing the team down
Debug issues across ML training systems, GPUs, clusters, networking, and related infrastructure
Help researchers and engineers unblock broken training jobs, flaky workflows, and brittle internal tooling
Improve the reliability, maintainability, and usability of the robotics team's training framework
Move quickly on practical engineering problems that directly affect team velocity

Requirements

Strong software engineering fundamentals and excellent code review judgment
Experience with ML systems, training frameworks, GPUs, distributed systems, infrastructure, or similarly complex technical environments
Ability to read and debug unfamiliar codebases quickly, and enjoy getting to root cause
Ship high-quality code with strong velocity and pragmatic judgment
Low-ego, responsive, and motivated by helping researchers and engineers move faster
Prefer being a highly effective hands-on IC over driving broad process-heavy initiatives
Experience reviewing messy, fast-moving, or AI-generated codebases

Skills

PythonPyTorchTensorFlowCUDADistributed SystemsGPU ProgrammingML Training FrameworksCode ReviewDebuggingInfrastructure

Similar roles at this salary range

All ML Engineering jobs →

Anthropic

Jun 8

Staff Software Engineer, Inference

Build and maintain distributed inference systems serving Claude to millions of users. Design intelligent routing, autoscaling, and high-performance infrastructure across diverse AI accelerators.

320k – 485kSan Francisco, CA +2ML EngineeringHybridAWSGCP

Airbnb

Jun 8

Senior Staff Machine Learning Engineer, Communication & Connectivity

Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.

244k – 305kUnited StatesML EngineeringRemotePythonAI Systems

Traba

Jun 8

Staff Software Engineer

Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.

240k – 300kNew York, NY +1ML EngineeringOn-siteLLMKafka

Nuance Labs

Jun 5

Member of Technical Staff — Model Optimization and Inference

Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.

250k – 350kSeattle, WAML EngineeringOn-siteAWQvLLM

OpenAI

Jun 5

Researcher: Agent Post-Training, API & Power-Users

Improve agentic model capabilities for API and power users by designing experiments, building evals from real workflows, and driving post-training interventions from discovery through launch.

295k – 445kSan Francisco, CAML EngineeringHybridRLLLMs

Apply