Production Engineer, IaaS
Own observability, API surface, and control plane for a hyperscale AI compute fleet. Build production-grade data pipelines, stateful APIs, and Kubernetes infrastructure that other teams depend on.
Builds and operates reliable, scalable AI infrastructure including observability, SLOs, incident response, automation, and performance tuning for ultra-low-latency serverless compute. Requires 3+ years SRE/DevOps experience with cloud, Kubernetes, programming (Go/Rust/Python), and observability tools.
Required skills:
Preferred:
Bonus:
Own observability, API surface, and control plane for a hyperscale AI compute fleet. Build production-grade data pipelines, stateful APIs, and Kubernetes infrastructure that other teams depend on.
Own end-to-end health, repair automation, and qualification of a hyperscale GPU/TPU compute fleet. Build metrics pipelines, firmware tooling, and self-healing repair workflows across Kubernetes and bare metal.
Builds and scales core infrastructure including ML training/serving, Kubernetes clusters, and low-latency voice/audio pipelines. Requires 3+ years in infrastructure/ML systems, hands-on reliability engineering, and Kubernetes expertise.
Designs and operates large-scale infrastructure for secure, scalable AI agent runtimes, untrusted code execution, and multi-cloud deployments. Requires strong expertise in distributed systems, containers, Kubernetes, and security.