# AI Infrastructure Engineer - Training Platform
**Company:** [Scale AI](https://hotfix.jobs/companies/scale-ai)
**Location:** San Francisco, CA, Seattle, WA, New York, NY
**Salary:** $216K-$270K
**Experience:** 5+ years
**Skills:** Kubernetes, Python, Go, Rust, C++, AWS, GCP, Terraform, Ray, Kueue, Deepspeed, Fsdp, CUDA, Nccl, PyTorch
**Posted:** 2026-04-28
> Builds and scales high-performance training platforms for large-scale GPU clusters, architecting orchestration, scheduling, and observability for ML workloads. Requires 5+ years in infrastructure engineering with ML focus, Kubernetes expertise, and systems programming.
## Job Description
## Responsibilities

- Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery.
- Design and implement scheduling primitives to optimize the lifecycle of training jobs.
- Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures.
- Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability.
- Work closely with Finance and Procurement teams to drive our capacity planning process.
- Participate in our team’s on call process to ensure the availability of our services.
- Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.

## Requirements

- 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes).
- Strong programming skills in one or more languages (e.g. **Python**, **Go**, **Rust**, **C++**).
- Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling.
- Experience with distributed training infrastructure, such as **EFA**, **Infiniband**, and topology-aware scheduling.
- Experience with distributed storage systems (e.g. **Lustre**, **S3**) as they relate to training throughput.
- Expert-level knowledge of **Kubernetes** internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware.
- Familiarity with cloud infrastructure (**AWS**, **GCP**) and infrastructure as code (e.g., **Terraform**).
- Proven ability to solve complex problems and work independently in fast-moving environments.

## Nice to Haves

- Experience with distributed training techniques such as **DeepSpeed**, **FSDP**, etc.
- Experience with the **NVIDIA** software and hardware stack (**CUDA**, **NCCL**).
- Experience with **PyTorch**.
- Familiarity with post-training algorithms such as GRPO, and with **Reinforcement Learning**.

## Compensation

- Base salary range: $216,000&mdash;$270,000 USD (varies by location, skills, experience).
- Equity and comprehensive benefits (health, dental, vision, retirement, PTO, learning stipend).
**Apply:** https://hotfix.jobs/jobs/ai-infrastructure-engineer-training-platform-at-scale-ai-0ded97de-4a79-4f8e-9ad3-a8fb2c3326fc
**Canonical:** https://hotfix.jobs/jobs/ai-infrastructure-engineer-training-platform-at-scale-ai-0ded97de-4a79-4f8e-9ad3-a8fb2c3326fc