Skip to content

Director of AI Infrastructure

176k – 265kSeattle, WADevOps / SREOnsite12+ YOE
Summary

Leads AI infrastructure including on-prem GPU clusters, hybrid cloud orchestration, storage, and resource allocation to support frontier AI research. Requires 12+ years experience with HPC leadership, GPU systems, Kubernetes, and distributed storage.

About the role

Responsibilities

  • Oversee the availability and performance of dense on-prem GPU clusters. Partner with hardware vendors and internal teams to ensure physical infrastructure meets demands of frontier model training.
  • Direct strategy for Beaker orchestration platform. Optimize job scheduling for high utilization of on-prem assets and elastic cloud resources (AWS/GCP).
  • Develop and execute long-term roadmap for storage balancing high-throughput performance for active training with cost-effective durability for petascale research data.
  • Act as primary steward of GPU compute budget. Make data-driven decisions on bursting to cloud vs. investing in on-prem capacity.
  • Serve as technical bridge to research teams, ensuring infrastructure accelerates diverse research objectives.

Requirements

  • 12+ years in infrastructure, systems engineering, or HPC, with at least 5 years in leadership managing multi-disciplinary engineering teams.
  • Bachelor’s degree in related field; relevant advanced degree may substitute for equivalent experience.
  • Direct experience managing large-scale NVIDIA GPU clusters and high-performance networking (InfiniBand/RoCE).
  • Strong background in Kubernetes, Slurm, or similar orchestration frameworks in hybrid-cloud configurations.
  • Experience with distributed filesystems (e.g., WEKA, Ceph, Lustre) and cloud storage integration at scale.
  • Proficient in Go or Python, with ability to review architecture and code for internal tooling.

Compensation: Base salary $176,400 - $264,600 plus generous bonus plans.

Skills
Linux kernelKubernetesSlurmNVIDIA GPUInfiniBandRoCEdistributed filesystemsWEKACephLustreGoPythonAWSGCPBeaker
Similar roles at this salary range
All DevOps / SRE jobs →
Plaid

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3