Skip to content

Principal Engineer, Compute Fleet Management

264k – 322kBellevue, WAOnsite
Summary

Leads compute fleet management across AWS, Azure, and GCP, optimizing billions of resources for peak performance, 99.99% availability, and 60%+ utilization. Requires deep distributed systems expertise and cross-team leadership for mission-critical infrastructure.

About the role

Outcomes

  • High Availability: Achieve and maintain 99.99% availability for all batch and serving workloads.
  • Stellar Efficiency: Drive utilization to 60% or higher, balancing efficiency with tolerance for cloud failures.
  • Best-in-Class Isolation: Architect and enforce strong security and performance isolation across diverse customer workloads.

Requirements

  • Leading Transformative Projects: Take ownership of complex, cross-team, cross-layer, and multi-quarter strategic engineering initiatives from concept to execution.
  • Distributed Systems Mastery: Deep, hands-on experience developing and operating high-scale distributed systems on at least one major public cloud.
  • Influence Without Authority: Proven ability to drive consensus, establish technical direction, and lead large technical efforts across organizational boundaries.
  • Execution Discipline: Exceptional strength in planning, tracking project progress, and managing complex cross-organizational dependencies.

The Edge: Highly Desirable Experience

  • Experience managing and scaling a massive fleet of GPUs for AI/ML workloads.
  • Experience with developing and operating large-scale distributed systems across all major clouds (AWS, Azure, and GCP).
Skills
Distributed SystemsAWSAzureGCPKubernetesFleet ManagementGPUCloud InfrastructureHigh Availability SystemsResource Optimization
Similar roles at this salary range
All DevOps / SRE jobs →
Onebrief

Principal Infrastructure Engineer

Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.

235k – 275kUnited StatesDevOps / SRERemoteGoAWS
Sentry

Staff Software Engineer, AI Developer Tooling

Own AI-assisted coding tooling at Sentry. Build harnesses, context systems, and API integrations so AI agents can operate across the full software development lifecycle.

240k – 320kSan Francisco, CADevOps / SREHybridCI/CDPython
Together AI

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.

250k – 300kSan Francisco, CADevOps / SREOn-siteGoCeph
Forge

Director of Platform & Reliability Engineering

The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.

235k – 245kSan Francisco, CADevOps / SREHybridCI/CDKubernetes
Zoox

Staff Site Reliability Engineer

Zoox is seeking a Staff Site Reliability Engineer to lead source control, owning the technical strategy and roadmap for their Git-based monorepo. This role involves migrating from GitHub Enterprise to GitHub Cloud, building developer tooling, and partnering with various teams to enhance source control as a strategic asset.

250k – 300kFoster City, CADevOps / SREHybridBuckCI/CD