Staff Software Engineer, AI Runtime
Staff Software Engineer building and scaling Databricks' managed large-scale GPU training platform (AIR). Focus on distributed training performance, scheduling, fault tolerance, and developer experience for thousands of accelerators.
Impact
- Drive the architecture and evolution of AIR's managed GPU training platform, delivering scalable, high-throughput, and resilient training across fleets that span thousands of accelerators.
- Solve the hardest problems in large-scale training, including multi-node orchestration, distributed parallelism strategies, GPU scheduling and dynamic routing, high-throughput data loading, and checkpoint and restore for very long-running jobs.
- Push GPU efficiency and training performance, raising utilization (such as model FLOPs utilization and end-to-end throughput) and lowering cost per training run across diverse model architectures and hardware generations.
- Build the resilience and observability foundations that keep multi-node jobs healthy, detecting and recovering from hardware and software failures with minimal disruption to customers.
- Partner with product, research, and platform teams to shape the APIs, CLI, and developer experience that make it easy to launch, monitor, and debug production training jobs.
- Lead end-to-end engineering efforts, from design through production rollout, holding a high bar for performance, correctness, and reliability.
- Make direct, high-impact contributions to the core systems behind AIR, and help bring up support for the latest accelerators and new regions as the fleet grows.
- Champion engineering excellence, mentor other engineers through design reviews and technical discussions, and help shape Databricks' long-term technical direction in AI training infrastructure.
Requirements
- 10+ years of experience building and operating large-scale distributed systems, with significant depth in GPU training infrastructure, high-performance computing, or ML systems.
- Hands-on experience with distributed training frameworks (such as PyTorch, FSDP, DeepSpeed, or Megatron) and the parallelism strategies (data, tensor, pipeline, and sequence parallelism) used to train large models.
- Strong understanding of training resilience patterns, including checkpointing, failure detection, and automatic recovery for long-running, multi-node jobs.
- Solid grasp of GPU performance fundamentals, including accelerator architecture, high-speed interconnects (such as NVLink and InfiniBand or RoCE), collective communication, and the bottlenecks that govern training throughput and utilization.
- Experience building and operating managed, multi-tenant platform products in the cloud, with clear SLAs and SLOs for availability, performance, and reliability.
- Strong foundation in algorithms, data structures, and system design as applied to performance-sensitive, large-scale distributed systems.
- Proven ability to deliver technically complex, high-impact initiatives that create clear customer or business value.
- Strong communication skills and the ability to collaborate across product, research, and infrastructure teams in a fast-moving environment.
- Strategic, product-oriented mindset with the ability to align technical execution to a long-term vision, and a passion for mentoring engineers and fostering technical excellence.
- BS in Computer Science or a related field (MS or PhD preferred).
Senior Software Engineer, AI Runtime
Senior Software Engineer building and scaling Databricks' managed GPU training platform (AI Runtime) for large-scale distributed AI model training. Requires 5+ years in distributed systems and hands-on experience with GPU training frameworks.
Sr. Machine Learning Engineer, Computer Vision
Build and prototype diffusion-based text-to-image generative models (Pinterest Canvas) using large-scale visual-text datasets. Requires 5+ years industry computer vision experience and an M.S. or Ph.D.
Senior AI/ML Engineer
Senior AI/ML Engineer building transformer and deep learning models on financial and behavioral data to power personalized growth and marketing experiences at Chime. Requires strong production ML experience with PyTorch, AWS, and large-scale data infrastructure.
Senior Software Engineer
Founding Senior Applied Agent Engineer building production LLM agent systems that automate supply chain workflows. Requires 5+ years engineering experience with 1+ year shipping LLM/agent features, strong Python/TypeScript skills, and hands-on agent stack experience.