Software Engineer, Compute Infra
Designs, builds, and operates massive-scale compute clusters and custom container orchestration platforms for AI training and inference at exascale. Requires deep expertise in virtualization, containerization, systems programming in C++/Rust, and Linux kernel internals.
Responsibilities
- Build and manage massive-scale clusters to host, persist, train, and serve AI workloads with extreme reliability and performance.
- Design, develop, and extend an in-house container orchestration platform that achieves superior scalability, isolation, resource efficiency, and fault-tolerance compared to off-the-shelf solutions.
- Collaborate with research teams to architect and optimize compute clusters specifically for large-scale training runs, inference services, and real-time applications.
- Profile, debug, and resolve complex system-level performance bottlenecks, resource contention, scheduling issues, and reliability problems across the full stack.
- Own end-to-end infrastructure initiatives with first-principles design, rigorous testing, automation, and continuous optimization to support frontier AI compute demands.
Required Qualifications
- Deep expertise in virtualization technologies (KVM, Xen, QEMU) and advanced containerization/sandboxing (Kata, Firecracker, gVisor, Sysbox, or equivalent).
- Strong proficiency in systems programming languages such as C/C++ and Rust.
- Proven track record profiling, debugging, and optimizing complex system-level performance issues, with deep knowledge of Linux kernel internals, resource management, scheduling, memory management, and low-level engineering.
- Hands-on experience building or significantly enhancing distributed compute platforms, orchestration systems, or high-performance infrastructure at scale.
- Ability to thrive in a fast-paced, meritocratic environment with full ownership, high standards, and a focus on rigorous execution.
Preferred Qualifications
- Experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads.
- Proven track record operating or designing large-scale AI training/inference clusters (GPU/TPU scale).
- Experience with custom runtimes, isolation techniques, or bespoke platforms for specialized AI compute.
- Familiarity with performance tools, tracing, and debugging in production distributed environments.
Compensation and Benefits
Annual Salary Range: $180,000 - $440,000 USD
Base salary is just one part of total rewards, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.
Senior Infrastructure Engineer
Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Senior Developer Experience Engineer
Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.