# Staff Software Engineer - AI Research Infrastructure
**Company:** [Databricks](https://hotfix.jobs/companies/databricks)
**Location:** New York, NY, San Francisco, CA
**Salary:** $190K-$270K
**Experience:** 5+ years
**Skills:** Kubernetes, Slurm, Ray, C++, Rust, Go, Java, Scala, Distributed Systems, Gpus, Cloud Providers, Ml Training, Model Parallelism, Hpc Clusters
**Posted:** 2026-05-01
> Builds and operates research infrastructure for large-scale AI model training and inference across GPU fleets. Partners with scientists and engineers to create scheduling, orchestration, and dev tooling for efficient experimentation. Requires 5+ years in distributed systems and systems programming.
## Job Description
## Responsibilities

- Design and implement infrastructure that supports large-scale experiments, data processing, and model training (e.g., HPC clusters, GPU fleets, or cloud-based systems)
- Enable researchers to go from idea to large-scale experiment in minutes by building powerful abstractions for job submission, scheduling, and monitoring
- Create tooling that improves research developer productivity, such as experiment management systems, CI/testing infrastructure for research code, and workflows that reduce iteration time
- Influence the long-term roadmap for research computation, shaping how Databricks AI Research train, evaluate, and ship models to customers
- Serve as a technical mentor and force multiplier for other engineers working on compute, infra, and AI systems

## Requirements

- BS/MS or PhD in Computer Science or related field
- 5+ years of software engineering experience, including substantial time working on large-scale distributed systems or infrastructure
- Deep experience with building and operating distributed systems, data pipelines, or large-scale backend services, ideally involving GPUs, clusters, or major cloud providers
- Proficient in one or more systems programming languages (**C++**, **Rust**, **Go**, **Java**, **Scala**) and can design, implement, and debug complex services
- Built or significantly contributed to cluster schedulers, resource managers, or large-scale job orchestration systems (**Kubernetes**, **Slurm**, **Ray**, custom internal systems)
- Understand modern ML training and inference workflows (e.g., distributed training, model parallelism, fine-tuning, evaluation)
- Can move fast and be pragmatic while caring about operational excellence; driven complex systems from prototype to stable services
- Communicate clearly with researchers and engineers
**Apply:** https://hotfix.jobs/jobs/staff-software-engineer-ai-research-infrastructure-at-databricks-e32d4d4c-5b8d-4fee-a20c-fa4a30f48ae5
**Canonical:** https://hotfix.jobs/jobs/staff-software-engineer-ai-research-infrastructure-at-databricks-e32d4d4c-5b8d-4fee-a20c-fa4a30f48ae5