Director of AI Infrastructure

176k – 265kSeattle, WADevOps / SREOnsite12+ YOEApr 22

Summary

Leads AI infrastructure including on-prem GPU clusters, hybrid cloud orchestration, storage, and resource allocation to support frontier AI research. Requires 12+ years experience with HPC leadership, GPU systems, Kubernetes, and distributed storage.

About the role

Responsibilities

Oversee the availability and performance of dense on-prem GPU clusters. Partner with hardware vendors and internal teams to ensure physical infrastructure meets demands of frontier model training.
Direct strategy for Beaker orchestration platform. Optimize job scheduling for high utilization of on-prem assets and elastic cloud resources (AWS/GCP).
Develop and execute long-term roadmap for storage balancing high-throughput performance for active training with cost-effective durability for petascale research data.
Act as primary steward of GPU compute budget. Make data-driven decisions on bursting to cloud vs. investing in on-prem capacity.
Serve as technical bridge to research teams, ensuring infrastructure accelerates diverse research objectives.

Requirements

12+ years in infrastructure, systems engineering, or HPC, with at least 5 years in leadership managing multi-disciplinary engineering teams.
Bachelor’s degree in related field; relevant advanced degree may substitute for equivalent experience.
Direct experience managing large-scale NVIDIA GPU clusters and high-performance networking (InfiniBand/RoCE).
Strong background in Kubernetes, Slurm, or similar orchestration frameworks in hybrid-cloud configurations.
Experience with distributed filesystems (e.g., WEKA, Ceph, Lustre) and cloud storage integration at scale.
Proficient in Go or Python, with ability to review architecture and code for internal tooling.

Compensation: Base salary $176,400 - $264,600 plus generous bonus plans.

Skills

Linux kernelKubernetesSlurmNVIDIA GPUInfiniBandRoCEdistributed filesystemsWEKACephLustreGoPythonAWSGCPBeaker

Similar roles at this salary range

All DevOps / SRE jobs →

Plaid

Jun 19

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Apply