Software Engineer, Compute Infra

180k – 440kPalo Alto, CAOnsiteMar 6

Summary

Designs, builds, and operates massive-scale compute clusters and custom container orchestration platforms for AI training and inference at exascale. Requires deep expertise in virtualization, containerization, systems programming in C++/Rust, and Linux kernel internals.

About the role

Responsibilities

Build and manage massive-scale clusters to host, persist, train, and serve AI workloads with extreme reliability and performance.
Design, develop, and extend an in-house container orchestration platform that achieves superior scalability, isolation, resource efficiency, and fault-tolerance compared to off-the-shelf solutions.
Collaborate with research teams to architect and optimize compute clusters specifically for large-scale training runs, inference services, and real-time applications.
Profile, debug, and resolve complex system-level performance bottlenecks, resource contention, scheduling issues, and reliability problems across the full stack.
Own end-to-end infrastructure initiatives with first-principles design, rigorous testing, automation, and continuous optimization to support frontier AI compute demands.

Required Qualifications

Deep expertise in virtualization technologies (KVM, Xen, QEMU) and advanced containerization/sandboxing (Kata, Firecracker, gVisor, Sysbox, or equivalent).
Strong proficiency in systems programming languages such as C/C++ and Rust.
Proven track record profiling, debugging, and optimizing complex system-level performance issues, with deep knowledge of Linux kernel internals, resource management, scheduling, memory management, and low-level engineering.
Hands-on experience building or significantly enhancing distributed compute platforms, orchestration systems, or high-performance infrastructure at scale.
Ability to thrive in a fast-paced, meritocratic environment with full ownership, high standards, and a focus on rigorous execution.

Preferred Qualifications

Experience in Linux kernel development, hypervisor extensions, or low-level system programming for compute-intensive workloads.
Proven track record operating or designing large-scale AI training/inference clusters (GPU/TPU scale).
Experience with custom runtimes, isolation techniques, or bespoke platforms for specialized AI compute.
Familiarity with performance tools, tracing, and debugging in production distributed environments.

Compensation and Benefits

Annual Salary Range: $180,000 - $440,000 USD

Base salary is just one part of total rewards, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

Skills

KubernetesKVMXenQEMUKataFirecrackergVisorSysboxC++RustLinux kernel

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 8

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit

Aurelian

Jun 8

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Aurelian

Jun 8

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Huntress

Jun 8

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby

Apply