Director of AI Infrastructure
176k – 265kSeattle, WADevOps / SREOnsite12+ YOE
Summary
Leads AI infrastructure including on-prem GPU clusters, hybrid cloud orchestration, storage, and resource allocation to support frontier AI research. Requires 12+ years experience with HPC leadership, GPU systems, Kubernetes, and distributed storage.
About the role
Responsibilities
- Oversee the availability and performance of dense on-prem GPU clusters. Partner with hardware vendors and internal teams to ensure physical infrastructure meets demands of frontier model training.
- Direct strategy for Beaker orchestration platform. Optimize job scheduling for high utilization of on-prem assets and elastic cloud resources (AWS/GCP).
- Develop and execute long-term roadmap for storage balancing high-throughput performance for active training with cost-effective durability for petascale research data.
- Act as primary steward of GPU compute budget. Make data-driven decisions on bursting to cloud vs. investing in on-prem capacity.
- Serve as technical bridge to research teams, ensuring infrastructure accelerates diverse research objectives.
Requirements
- 12+ years in infrastructure, systems engineering, or HPC, with at least 5 years in leadership managing multi-disciplinary engineering teams.
- Bachelor’s degree in related field; relevant advanced degree may substitute for equivalent experience.
- Direct experience managing large-scale NVIDIA GPU clusters and high-performance networking (InfiniBand/RoCE).
- Strong background in Kubernetes, Slurm, or similar orchestration frameworks in hybrid-cloud configurations.
- Experience with distributed filesystems (e.g., WEKA, Ceph, Lustre) and cloud storage integration at scale.
- Proficient in Go or Python, with ability to review architecture and code for internal tooling.
Compensation: Base salary $176,400 - $264,600 plus generous bonus plans.
Skills
Linux kernelKubernetesSlurmNVIDIA GPUInfiniBandRoCEdistributed filesystemsWEKACephLustreGoPythonAWSGCPBeaker
Similar roles at this salary range
All DevOps / SRE jobs →Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.
194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE