Skip to content

Staff Infrastructure Engineer, Cluster Infrastructure

San Francisco, CANew York, NYSeattle, WAHybrid8+ YOE
Summary

Leads technical strategy for agent-driven cluster lifecycle management, provisioning, and scalability across cloud providers and datacenters. Requires deep expertise in distributed systems, Kubernetes, IaC tools like Terraform, and systems languages like Rust/Go/Python; 8+ years experience preferred.

About the role

Key Responsibilities

  • Own the technical strategy and roadmap for agent-driven cluster lifecycle management - provisioning, updates and decommissioning
  • Partner across teams to ensure new compute capacity is ingested on time
  • Align with partner teams on physical build-out and leverage cloud solutions to deliver high-bandwidth inter-cluster connectivity
  • Collaborate with security owners to ensure clusters are provisioned secure-by-default
  • Define and drive strategy on cluster scalability, homogeneity and fault tolerance
  • Work closely with cloud providers and internal research, inference and product teams to shape long-term compute, data, and infrastructure strategy
  • Establish and evolve operational-excellence practices: incident response, postmortem culture and on-call health
  • Support the growth of engineers around you through technical mentorship and coaching

Minimum Qualifications

  • Deep expertise in distributed systems, reliability, and cloud platforms (Kubernetes, IaC, AWS/GCP/Azure)
  • Strong proficiency in at least one systems language (Rust, Go, or Python), IaC proficiency with Terraform
  • Track record of leading complex, multi-quarter technical initiatives spanning multiple teams or systems
  • Ability to build alignment across senior stakeholders and communicate effectively at all levels

Preferred Qualifications

  • 8+ years of software engineering experience, including time as a technical lead setting direction for a team
  • Experience operating large-scale compute infrastructure at hyperscale (100+ clusters, 10K+ nodes)
  • Depth in one or more of: Kubernetes internals, cluster provisioning and management systems, cluster orchestration systems (Mesos, Borg-like)
  • Experience with cloud networking: VPC design and peering, Shared VPC/Transit Gateway, Cloud Interconnect/Direct Connect, Cloud NAT, cross-cloud private connectivity, BGP and route control, edge load balancing and DDoS mitigation (Cloud Armor / AWS Shield)
  • Experience with cluster and host networking: CNI (Cilium), eBPF, NetworkPolicy, multi-NIC, sFlow, service mesh (Istio/Envoy/Linkerd, mTLS)
  • Experience with cluster security: pod security standards and admission control, RBAC and least-privilege IAM, node and container hardening, supply-chain/image provenance
  • Deep experience with infrastructure-as-code (Terraform, Atlantis), workflow orchestration (Temporal, Argo Workflows)
  • Skill in quickly understanding systems design tradeoffs and keeping track of rapidly evolving software systems
Skills
KubernetesTerraformRustGoPythonAWSGCPAzureCiliumeBPFIstioEnvoyTemporalArgo WorkflowsBGP