Skip to content

Staff Software Engineer - AI Research Infrastructure

190k – 270kNew York, NYOnsite5+ YOE
Summary

Builds and operates research infrastructure for large-scale AI model training and inference across GPU fleets. Partners with scientists and engineers to create scheduling, orchestration, and dev tooling for efficient experimentation. Requires 5+ years in distributed systems and systems programming.

About the role

Responsibilities

  • Design and implement infrastructure that supports large-scale experiments, data processing, and model training (e.g., HPC clusters, GPU fleets, or cloud-based systems).
  • Enable researchers to go from idea to large-scale experiment in minutes by building powerful abstractions for job submission, scheduling, and monitoring.
  • Create tooling that improves research developer productivity, such as experiment management systems, CI/testing infrastructure for research code, and workflows that reduce iteration time.
  • Influence the long-term roadmap for research computation, shaping how Databricks AI Research train, evaluate, and ship models to customers.
  • Serve as a technical mentor and force multiplier for other engineers working on compute, infra, and AI systems.

Requirements

  • BS/MS or PhD in Computer Science or related field.
  • 5+ years of software engineering experience, including substantial time working on large-scale distributed systems or infrastructure.
  • Deep experience with building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers.
  • Proficient in one or more systems programming languages (C++, Rust, Go, Java, Scala) and can design, implement, and debug complex services.
  • Built or significantly contributed to cluster schedulers, resource managers, or large-scale job orchestration systems (Kubernetes, Slurm, Ray, custom internal systems).
  • Understand modern ML training and inference workflows (e.g., distributed training, model parallelism, fine-tuning, evaluation).
  • Can move fast and be pragmatic while caring about operational excellence; driven complex systems from prototype to stable services.
  • Communicate clearly with researchers and engineers.
Skills
KubernetesSlurmRayC++RustGoJavaScalaDistributed SystemsGPUsCloud ProvidersML TrainingModel Parallelism
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Aurelian

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Huntress

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS