Skip to content

Software Engineer, ML Systems & Training Architecture

295k – 380kSan Francisco, CAOnsite3+ YOE
Summary

Hands-on senior software engineer focused on maintaining and improving ML training infrastructure, debugging training systems, and unblocking researchers on the robotics team.

About the role

Responsibilities

  • Review, improve, and clean up code across training frameworks and adjacent infrastructure
  • Identify risky or low-quality changes before they land, and raise the code quality bar without slowing the team down
  • Debug issues across ML training systems, GPUs, clusters, networking, and related infrastructure
  • Help researchers and engineers unblock broken training jobs, flaky workflows, and brittle internal tooling
  • Improve the reliability, maintainability, and usability of the robotics team's training framework
  • Move quickly on practical engineering problems that directly affect team velocity

Requirements

  • Strong software engineering fundamentals and excellent code review judgment
  • Experience with ML systems, training frameworks, GPUs, distributed systems, infrastructure, or similarly complex technical environments
  • Ability to read and debug unfamiliar codebases quickly, and enjoy getting to root cause
  • Ship high-quality code with strong velocity and pragmatic judgment
  • Low-ego, responsive, and motivated by helping researchers and engineers move faster
  • Prefer being a highly effective hands-on IC over driving broad process-heavy initiatives
  • Experience reviewing messy, fast-moving, or AI-generated codebases
Skills
PythonPyTorchTensorFlowCUDADistributed SystemsGPU ProgrammingML Training FrameworksCode ReviewDebuggingInfrastructure
Similar roles at this salary range
All ML Engineering jobs →
Anthropic

Staff Software Engineer, Inference

Build and maintain distributed inference systems serving Claude to millions of users. Design intelligent routing, autoscaling, and high-performance infrastructure across diverse AI accelerators.

320k – 485kSan Francisco, CA +2ML EngineeringHybridAWSGCP
Airbnb

Senior Staff Machine Learning Engineer, Communication & Connectivity

Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.

244k – 305kUnited StatesML EngineeringRemotePythonAI Systems
Traba

Staff Software Engineer

Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.

240k – 300kNew York, NY +1ML EngineeringOn-siteLLMKafka
Nuance Labs

Member of Technical Staff — Model Optimization and Inference

Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.

250k – 350kSeattle, WAML EngineeringOn-siteAWQvLLM
OpenAI

Researcher: Agent Post-Training, API & Power-Users

Improve agentic model capabilities for API and power users by designing experiments, building evals from real workflows, and driving post-training interventions from discovery through launch.

295k – 445kSan Francisco, CAML EngineeringHybridRLLLMs