Responsibilities:
- Own and drive critical programs across the compute lifecycle, coordinating execution across multiple engineering, research, and operations teams
- Build and maintain operational visibility into the compute fleet, ensuring the organization has a clear picture of supply, demand, utilization, and health
- Lead cross-functional coordination for compute transitions: bringing new capacity online, migrating workloads, and managing decommissions across cloud providers and hardware platforms
- Partner with engineering and research leadership to navigate competing priorities and drive alignment on how compute resources are planned, allocated, and used
- Identify and close operational gaps across the compute pipeline, whether through new tooling, improved processes, or better cross-team communication
- Own trade-off discussions between utilization, cost, latency, and reliability, synthesizing inputs from technical and business stakeholders and communicating decisions to leadership
- Develop and improve the processes and frameworks the team uses to plan, track, and execute compute programs at increasing scale and complexity
You may be a good fit if you:
- Have 7+ years of technical program management experience in infrastructure, platform engineering, or compute-intensive environments
- Have led complex, cross-functional programs involving multiple engineering teams with competing priorities and ambiguous requirements
- Have experience working with research or ML teams and translating their needs into operational plans and technical requirements
- Are comfortable diving deep into technical details (cloud infrastructure, cluster management, job scheduling, resource orchestration) while maintaining program-level visibility
- Thrive in ambiguous, fast-moving environments where you need to define scope and build processes from the ground up
- Have strong communication skills and can engage credibly with engineers, researchers, finance, and executive leadership
- Have a track record of building trust with engineering teams and driving changes through influence rather than authority
Strong candidates may also have:
- Experience managing compute capacity across multiple cloud providers (AWS, GCP, Azure) or hybrid cloud/on-premise environments
- Familiarity with job scheduling, resource orchestration, or workload management systems (Kubernetes, Slurm, Borg, YARN, or custom schedulers)
- Experience with GPU or accelerator infrastructure, including the unique challenges of large-scale ML training and inference workloads
- Built or improved observability for infrastructure systems: dashboards, alerting, efficiency metrics, or cost attribution
- Capacity planning experience including demand forecasting, cost modeling, or hardware lifecycle management
- Scaled through hypergrowth in AI/ML, HPC, or large-scale cloud environments
Annual Salary: $365,000 — $435,000 USD