Principal Engineer, Compute Fleet Management

264k – 322kBellevue, WAOnsiteJan 30

Summary

Leads compute fleet management across AWS, Azure, and GCP, optimizing billions of resources for peak performance, 99.99% availability, and 60%+ utilization. Requires deep distributed systems expertise and cross-team leadership for mission-critical infrastructure.

About the role

Outcomes

High Availability: Achieve and maintain 99.99% availability for all batch and serving workloads.
Stellar Efficiency: Drive utilization to 60% or higher, balancing efficiency with tolerance for cloud failures.
Best-in-Class Isolation: Architect and enforce strong security and performance isolation across diverse customer workloads.

Requirements

Leading Transformative Projects: Take ownership of complex, cross-team, cross-layer, and multi-quarter strategic engineering initiatives from concept to execution.
Distributed Systems Mastery: Deep, hands-on experience developing and operating high-scale distributed systems on at least one major public cloud.
Influence Without Authority: Proven ability to drive consensus, establish technical direction, and lead large technical efforts across organizational boundaries.
Execution Discipline: Exceptional strength in planning, tracking project progress, and managing complex cross-organizational dependencies.

The Edge: Highly Desirable Experience

Experience managing and scaling a massive fleet of GPUs for AI/ML workloads.
Experience with developing and operating large-scale distributed systems across all major clouds (AWS, Azure, and GCP).

Skills

Distributed SystemsAWSAzureGCPKubernetesFleet ManagementGPUCloud InfrastructureHigh Availability SystemsResource Optimization

Similar roles at this salary range

All DevOps / SRE jobs →

Onebrief

Jun 4

Principal Infrastructure Engineer

Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.

235k – 275kUnited StatesDevOps / SRERemoteGoAWS

Sentry

Jun 4

Staff Software Engineer, AI Developer Tooling

Own AI-assisted coding tooling at Sentry. Build harnesses, context systems, and API integrations so AI agents can operate across the full software development lifecycle.

240k – 320kSan Francisco, CADevOps / SREHybridCI/CDPython

Together AI

Jun 4

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.

250k – 300kSan Francisco, CADevOps / SREOn-siteGoCeph

Forge

Jun 4

Director of Platform & Reliability Engineering

The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.

235k – 245kSan Francisco, CADevOps / SREHybridCI/CDKubernetes

Zoox

Jun 3

Staff Site Reliability Engineer

Zoox is seeking a Staff Site Reliability Engineer to lead source control, owning the technical strategy and roadmap for their Git-based monorepo. This role involves migrating from GitHub Enterprise to GitHub Cloud, building developer tooling, and partnering with various teams to enhance source control as a strategic asset.

250k – 300kFoster City, CADevOps / SREHybridBuckCI/CD

Apply