Staff Infrastructure Engineer, Cluster Infrastructure
San Francisco, CANew York, NYSeattle, WAHybrid8+ YOE
Summary
Leads technical strategy for agent-driven cluster lifecycle management, provisioning, and scalability across cloud providers and datacenters. Requires deep expertise in distributed systems, Kubernetes, IaC tools like Terraform, and systems languages like Rust/Go/Python; 8+ years experience preferred.
About the role
Key Responsibilities
- Own the technical strategy and roadmap for agent-driven cluster lifecycle management - provisioning, updates and decommissioning
- Partner across teams to ensure new compute capacity is ingested on time
- Align with partner teams on physical build-out and leverage cloud solutions to deliver high-bandwidth inter-cluster connectivity
- Collaborate with security owners to ensure clusters are provisioned secure-by-default
- Define and drive strategy on cluster scalability, homogeneity and fault tolerance
- Work closely with cloud providers and internal research, inference and product teams to shape long-term compute, data, and infrastructure strategy
- Establish and evolve operational-excellence practices: incident response, postmortem culture and on-call health
- Support the growth of engineers around you through technical mentorship and coaching
Minimum Qualifications
- Deep expertise in distributed systems, reliability, and cloud platforms (Kubernetes, IaC, AWS/GCP/Azure)
- Strong proficiency in at least one systems language (Rust, Go, or Python), IaC proficiency with Terraform
- Track record of leading complex, multi-quarter technical initiatives spanning multiple teams or systems
- Ability to build alignment across senior stakeholders and communicate effectively at all levels
Preferred Qualifications
- 8+ years of software engineering experience, including time as a technical lead setting direction for a team
- Experience operating large-scale compute infrastructure at hyperscale (100+ clusters, 10K+ nodes)
- Depth in one or more of: Kubernetes internals, cluster provisioning and management systems, cluster orchestration systems (Mesos, Borg-like)
- Experience with cloud networking: VPC design and peering, Shared VPC/Transit Gateway, Cloud Interconnect/Direct Connect, Cloud NAT, cross-cloud private connectivity, BGP and route control, edge load balancing and DDoS mitigation (Cloud Armor / AWS Shield)
- Experience with cluster and host networking: CNI (Cilium), eBPF, NetworkPolicy, multi-NIC, sFlow, service mesh (Istio/Envoy/Linkerd, mTLS)
- Experience with cluster security: pod security standards and admission control, RBAC and least-privilege IAM, node and container hardening, supply-chain/image provenance
- Deep experience with infrastructure-as-code (Terraform, Atlantis), workflow orchestration (Temporal, Argo Workflows)
- Skill in quickly understanding systems design tradeoffs and keeping track of rapidly evolving software systems
Skills
KubernetesTerraformRustGoPythonAWSGCPAzureCiliumeBPFIstioEnvoyTemporalArgo WorkflowsBGP