Software Engineer, Agent Infrastructure
230k – 385kSan Francisco, CANew York, NYHybrid
Summary
Builds and scales infrastructure for training and deploying AI agents, including novel container orchestration beyond Kubernetes, FastAPI/gRPC APIs, and Terraform-based systems. Collaborates with researchers on high-scale ML environments and production platforms for OpenAI products.
About the role
Responsibilities
- Push massive compute clusters to their limits as a core contributor to a novel in-house container orchestration platform that scales beyond Kubernetes.
- Develop and maintain FastAPI and gRPC APIs serving as the interface for agentic infrastructure in training and production.
- Use Terraform to stand up and evolve complex infrastructure for research and production.
- Collaborate with research teams to stand up and optimize systems for novel AI training runs and experimental applications.
Requirements
- Deep experience working on large-scale machine learning infrastructure, reasoning about training at scale, identifying bottlenecks, and engineering optimization solutions.
- Ability to build new things from 0-1 quickly and scale them 1,000,000x.
- Keen eye for performance and optimization in complex, globally-distributed systems.
- Experience with cloud platforms and infrastructure-as-code like Terraform.
- Driven by solving complex, ambiguous problems at the intersection of infrastructure scalability, virtualization efficiency, and agentic capabilities.
- Deep technical expertise in virtualization and containerization technologies (e.g. Kata, Firecracker, gVisor, Sysbox) and passion for optimizing runtime performance.
Skills
KubernetesFastAPIgRPCTerraformKataFirecrackergVisorSysboxcontainer orchestrationmachine learning infrastructure
Similar roles at this salary range
All Backend Engineering jobs →Staff Software Engineer, Growth AI
Staff Software Engineer anchoring AI-powered growth products across SEO and exploratory teams. Architect production ML systems, partner with ML orgs, and set technical direction as a senior IC.
208k – 365kSan Francisco, CA +3Backend EngineeringHybridJavaLLMs
Staff Backend Engineer, Search
Staff-level search engineer responsible for designing, scaling, and optimizing ClickUp's search infrastructure using OpenSearch/ElasticSearch, including real-time indexing, vector search, and relevance tuning.
250k – 300kUnited StatesBackend EngineeringRemoteNLPIndexing