Principal Engineer, Compute Fleet Management
Leads compute fleet management across AWS, Azure, and GCP, optimizing billions of resources for peak performance, 99.99% availability, and 60%+ utilization. Requires deep distributed systems expertise and cross-team leadership for mission-critical infrastructure.
Outcomes
- High Availability: Achieve and maintain 99.99% availability for all batch and serving workloads.
- Stellar Efficiency: Drive utilization to 60% or higher, balancing efficiency with tolerance for cloud failures.
- Best-in-Class Isolation: Architect and enforce strong security and performance isolation across diverse customer workloads.
Requirements
- Leading Transformative Projects: Take ownership of complex, cross-team, cross-layer, and multi-quarter strategic engineering initiatives from concept to execution.
- Distributed Systems Mastery: Deep, hands-on experience developing and operating high-scale distributed systems on at least one major public cloud.
- Influence Without Authority: Proven ability to drive consensus, establish technical direction, and lead large technical efforts across organizational boundaries.
- Execution Discipline: Exceptional strength in planning, tracking project progress, and managing complex cross-organizational dependencies.
The Edge: Highly Desirable Experience
- Experience managing and scaling a massive fleet of GPUs for AI/ML workloads.
- Experience with developing and operating large-scale distributed systems across all major clouds (AWS, Azure, and GCP).
Principal Infrastructure Engineer
Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.
Staff Engineer, Distributed Storage and HPC & AI Infrastructure
Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.
Director of Platform & Reliability Engineering
The Director of Platform & Reliability Engineering will lead an engineering organization responsible for secure, scalable, and highly reliable products. This role involves setting the vision for internal platforms, cloud infrastructure, developer enablement, and production operations.
Staff Site Reliability Engineer
Zoox is seeking a Staff Site Reliability Engineer to lead source control, owning the technical strategy and roadmap for their Git-based monorepo. This role involves migrating from GitHub Enterprise to GitHub Cloud, building developer tooling, and partnering with various teams to enhance source control as a strategic asset.