Staff Software Engineer - AI Research Infrastructure

Builds and operates research infrastructure for large-scale AI model training and inference across GPU fleets. Partners with scientists and engineers to create scheduling, orchestration, and dev tooling for efficient experimentation. Requires 5+ years in distributed systems and systems programming.

190k – 270kNew York, NYSan Francisco, CADevOps / SREOnsite5+ YOE

Apply

About the role

Responsibilities

Design and implement infrastructure that supports large-scale experiments, data processing, and model training (e.g., HPC clusters, GPU fleets, or cloud-based systems)
Enable researchers to go from idea to large-scale experiment in minutes by building powerful abstractions for job submission, scheduling, and monitoring
Create tooling that improves research developer productivity, such as experiment management systems, CI/testing infrastructure for research code, and workflows that reduce iteration time
Influence the long-term roadmap for research computation, shaping how Databricks AI Research train, evaluate, and ship models to customers
Serve as a technical mentor and force multiplier for other engineers working on compute, infra, and AI systems

Requirements

BS/MS or PhD in Computer Science or related field
5+ years of software engineering experience, including substantial time working on large-scale distributed systems or infrastructure
Deep experience with building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers
Proficient in one or more systems programming languages (C++, Rust, Go, Java, Scala) and can design, implement, and debug complex services
Built or significantly contributed to cluster schedulers, resource managers, or large-scale job orchestration systems (Kubernetes, Slurm, Ray, custom internal systems)
Understand modern ML training and inference workflows (e.g., distributed training, model parallelism, fine-tuning, evaluation)
Can move fast and be pragmatic while caring about operational excellence; driven complex systems from prototype to stable services
Communicate clearly with researchers and engineers

Skills

KubernetesSlurmRayC++RustGoJavaScalaDistributed SystemsGpusCloud ProvidersMl TrainingModel ParallelismHpc Clusters

Similar roles

DevOps / SRE jobs

Tatari

Staff Data Platform Engineer

As a Staff Data Platform Engineer, you will own the reliability, stability, and operational health of the data platform infrastructure. This role focuses on systems and infrastructure, enforcing environment promotion discipline, defining SOPs, and collaborating with data scientists and engineers.

190k – 240kLos Angeles, CADevOps / SREHybrid7+ YOEAWSBash

Tatari

Staff Data Platform Engineer

As a Staff Data Platform Engineer, you will be responsible for the reliability, stability, and operational health of the data platform. This role focuses on administering, scaling, hardening, and evolving the platform, with a strong emphasis on operational discipline and SRE principles.

190k – 240kNew York, NYDevOps / SREHybrid7+ YOEAWSBash

Tatari

Staff Data Platform Engineer

As a Staff Data Platform Engineer, you will own the reliability, stability, and operational health of the data platform infrastructure. This role focuses on systems and infrastructure, ensuring proper deployment, monitoring, maintenance, and promotion across environments.

190k – 240kSan Francisco, CADevOps / SREHybrid7+ YOEAWSBash

Databricks

Staff Software Engineer - AI Research Infrastructure

190k – 270kNew York, NYDevOps / SREOn-site5+ YOEGoRay

Ashby

Staff Platform Engineer, Americas

Staff Platform Engineer builds and scales infrastructure, optimizes compilers and databases, implements deployment tools like canary deploys and feature flags, and ensures reliability with SLOs/SLIs on AWS/Kubernetes. Requires strong coding skills in TypeScript/Node.js and handling diverse infra challenges end-to-end.

190k – 275kSan Francisco, CA +10DevOps / SRERemoteAWSSQL