Skip to content

Staff Software Engineer - AI Research Infrastructure

Builds and operates research infrastructure for large-scale AI model training and inference across GPU fleets. Partners with scientists and engineers to create scheduling, orchestration, and dev tooling for efficient experimentation. Requires 5+ years in distributed systems and systems programming.

190k – 270kNew York, NYSan Francisco, CADevOps / SREOnsite5+ YOE

About the role

Responsibilities

  • Design and implement infrastructure that supports large-scale experiments, data processing, and model training (e.g., HPC clusters, GPU fleets, or cloud-based systems)
  • Enable researchers to go from idea to large-scale experiment in minutes by building powerful abstractions for job submission, scheduling, and monitoring
  • Create tooling that improves research developer productivity, such as experiment management systems, CI/testing infrastructure for research code, and workflows that reduce iteration time
  • Influence the long-term roadmap for research computation, shaping how Databricks AI Research train, evaluate, and ship models to customers
  • Serve as a technical mentor and force multiplier for other engineers working on compute, infra, and AI systems

Requirements

  • BS/MS or PhD in Computer Science or related field
  • 5+ years of software engineering experience, including substantial time working on large-scale distributed systems or infrastructure
  • Deep experience with building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers
  • Proficient in one or more systems programming languages (C++, Rust, Go, Java, Scala) and can design, implement, and debug complex services
  • Built or significantly contributed to cluster schedulers, resource managers, or large-scale job orchestration systems (Kubernetes, Slurm, Ray, custom internal systems)
  • Understand modern ML training and inference workflows (e.g., distributed training, model parallelism, fine-tuning, evaluation)
  • Can move fast and be pragmatic while caring about operational excellence; driven complex systems from prototype to stable services
  • Communicate clearly with researchers and engineers

Skills

KubernetesSlurmRayC++RustGoJavaScalaDistributed SystemsGpusCloud ProvidersMl TrainingModel ParallelismHpc Clusters

Similar roles

DevOps / SRE jobs

Staff Data Platform Engineer

As a Staff Data Platform Engineer, you will own the reliability, stability, and operational health of the data platform infrastructure. This role focuses on systems and infrastructure, enforcing environment promotion discipline, defining SOPs, and collaborating with data scientists and engineers.

190k – 240kLos Angeles, CADevOps / SREHybrid7+ YOEAWSBash

Staff Data Platform Engineer

As a Staff Data Platform Engineer, you will be responsible for the reliability, stability, and operational health of the data platform. This role focuses on administering, scaling, hardening, and evolving the platform, with a strong emphasis on operational discipline and SRE principles.

190k – 240kNew York, NYDevOps / SREHybrid7+ YOEAWSBash

Staff Data Platform Engineer

As a Staff Data Platform Engineer, you will own the reliability, stability, and operational health of the data platform infrastructure. This role focuses on systems and infrastructure, ensuring proper deployment, monitoring, maintenance, and promotion across environments.

190k – 240kSan Francisco, CADevOps / SREHybrid7+ YOEAWSBash

Staff Software Engineer - AI Research Infrastructure

Builds and operates research infrastructure for large-scale AI model training and inference across GPU fleets. Partners with scientists and engineers to create scheduling, orchestration, and dev tooling for efficient experimentation. Requires 5+ years in distributed systems and systems programming.

190k – 270kNew York, NYDevOps / SREOn-site5+ YOEGoRay

Staff Platform Engineer, Americas

Staff Platform Engineer builds and scales infrastructure, optimizes compilers and databases, implements deployment tools like canary deploys and feature flags, and ensures reliability with SLOs/SLIs on AWS/Kubernetes. Requires strong coding skills in TypeScript/Node.js and handling diverse infra challenges end-to-end.

190k – 275kSan Francisco, CA +10DevOps / SRERemoteAWSSQL