Staff Software Engineer, AI Runtime

190k – 265kMountain View, CASan Francisco, CAOnsite10+ YOEJun 8

Summary

Staff Software Engineer building and scaling Databricks' managed large-scale GPU training platform (AIR). Focus on distributed training performance, scheduling, fault tolerance, and developer experience for thousands of accelerators.

About the role

Impact

Drive the architecture and evolution of AIR's managed GPU training platform, delivering scalable, high-throughput, and resilient training across fleets that span thousands of accelerators.
Solve the hardest problems in large-scale training, including multi-node orchestration, distributed parallelism strategies, GPU scheduling and dynamic routing, high-throughput data loading, and checkpoint and restore for very long-running jobs.
Push GPU efficiency and training performance, raising utilization (such as model FLOPs utilization and end-to-end throughput) and lowering cost per training run across diverse model architectures and hardware generations.
Build the resilience and observability foundations that keep multi-node jobs healthy, detecting and recovering from hardware and software failures with minimal disruption to customers.
Partner with product, research, and platform teams to shape the APIs, CLI, and developer experience that make it easy to launch, monitor, and debug production training jobs.
Lead end-to-end engineering efforts, from design through production rollout, holding a high bar for performance, correctness, and reliability.
Make direct, high-impact contributions to the core systems behind AIR, and help bring up support for the latest accelerators and new regions as the fleet grows.
Champion engineering excellence, mentor other engineers through design reviews and technical discussions, and help shape Databricks' long-term technical direction in AI training infrastructure.

Requirements

10+ years of experience building and operating large-scale distributed systems, with significant depth in GPU training infrastructure, high-performance computing, or ML systems.
Hands-on experience with distributed training frameworks (such as PyTorch, FSDP, DeepSpeed, or Megatron) and the parallelism strategies (data, tensor, pipeline, and sequence parallelism) used to train large models.
Strong understanding of training resilience patterns, including checkpointing, failure detection, and automatic recovery for long-running, multi-node jobs.
Solid grasp of GPU performance fundamentals, including accelerator architecture, high-speed interconnects (such as NVLink and InfiniBand or RoCE), collective communication, and the bottlenecks that govern training throughput and utilization.
Experience building and operating managed, multi-tenant platform products in the cloud, with clear SLAs and SLOs for availability, performance, and reliability.
Strong foundation in algorithms, data structures, and system design as applied to performance-sensitive, large-scale distributed systems.
Proven ability to deliver technically complex, high-impact initiatives that create clear customer or business value.
Strong communication skills and the ability to collaborate across product, research, and infrastructure teams in a fast-moving environment.
Strategic, product-oriented mindset with the ability to align technical execution to a long-term vision, and a passion for mentoring engineers and fostering technical excellence.
BS in Computer Science or a related field (MS or PhD preferred).

Skills

PyTorchFSDPDeepSpeedMegatronGPU schedulingDistributed trainingCheckpointingNVLinkInfiniBandRoCECollective communicationMulti-node orchestrationHigh-performance computingML systems

Similar roles at this salary range

All ML Engineering jobs →

Databricks

Jun 8

Senior Software Engineer, AI Runtime

Senior Software Engineer building and scaling Databricks' managed GPU training platform (AI Runtime) for large-scale distributed AI model training. Requires 5+ years in distributed systems and hands-on experience with GPU training frameworks.

160k – 225kMountain View, CA +1ML EngineeringOn-siteFSDPRoCE

Jun 8

Sr. Machine Learning Engineer, Computer Vision

Build and prototype diffusion-based text-to-image generative models (Pinterest Canvas) using large-scale visual-text datasets. Requires 5+ years industry computer vision experience and an M.S. or Ph.D.

161k – 332kSan Francisco, CAML EngineeringRemoteRLHFPyTorch

Checkr

Jun 8

Machine Learning Engineer

Build and ship production ML/AI services powering background checks. Own end-to-end ML systems using LLMs, Python, and modern MLOps practices.

168k – 198kSan Francisco, CAML EngineeringOn-siteNLPdbt

Chime

Jun 8

Senior AI/ML Engineer

Senior AI/ML Engineer building transformer and deep learning models on financial and behavioral data to power personalized growth and marketing experiences at Chime. Requires strong production ML experience with PyTorch, AWS, and large-scale data infrastructure.

172k – 238kChicago, IL +3ML EngineeringHybridSQLAWS

Traba

Jun 8

Senior Software Engineer

Founding Senior Applied Agent Engineer building production LLM agent systems that automate supply chain workflows. Requires 5+ years engineering experience with 1+ year shipping LLM/agent features, strong Python/TypeScript skills, and hands-on agent stack experience.

200k – 240kNew York, NY +1ML EngineeringOn-sitePythonNode.js

Apply