What You’ll Be Working On
Unifying Infrastructure Pillars
- Bare-Metal-as-a-Service (BMaaS): Architect systems that deliver raw GPU throughput via zero-latency InfiniBand/RDMA fabrics for massive-scale training.
- Intelligent IaaS: Design highly optimized, thin virtualization layers using KVM or custom micro-VMs to provide enterprise-grade isolation without the "virtualization tax."
- Elastic CaaS: Build a high-performance container substrate (utilizing Kubernetes or Slurm) that allows AI workloads to burst and scale across heterogeneous GPU nodes.
Mastering the I/O Path
- Lead the architectural design of our internal cloud fabric, drawing on experience from top-tier hyperscalers to drive the technical roadmap for SR-IOV, RDMA, and virtualized GPU scheduling.
Advanced R&D Leadership
- Lead elite workstreams to prototype and productionize novel methods for managing memory, networking, and compute that don't yet exist in standard cloud distributions.
Technical Strategy & Documentation
- Draft white papers and RFCs that define the next two years of Crusoe’s compute and networking stack.
High-Level Debugging
- Work alongside Staff and Senior engineers to resolve complex race conditions in the I/O path and optimize kernel-level memory pinning for GPU clusters.
Industry Influence
- Represent Crusoe in open-source communities and industry forums to influence the global direction of cloud-native AI infrastructure.
What You’ll Bring to the Team
- Hyperscale Provenance: 12+ years of experience designing and shipping core infrastructure at a major hyperscaler (e.g., OCI, AWS, Azure, GCP) or a specialized HPC cloud.
- Deep Systems Authority: Authoritative knowledge of the Linux kernel, virtualization internals (KVM, QEMU, Firecracker), and high-performance networking (RoCE v2, InfiniBand).
- Hardware-Software Co-Design: Proven ability to design software that maximizes the performance of NVIDIA/AMD GPUs and high-speed NICs.
- R&D Leadership: Experience leading cross-functional teams through high-ambiguity projects and delivering production-ready, mission-critical systems.
- Industry Contributions: A portfolio of significant contributions to the field, which may include patents, major open-source contributions, or published research in distributed systems.
- Communication Mastery: The rare ability to explain the nuances of memory-mapped I/O to an engineer and the business value of a new fabric architecture to the Board.
Mandatory Education: A Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or a related analytical field (or equivalent professional experience).
Bonus Points:
- Patent Holder: Possession of patents related to network virtualization, GPU scheduling, or distributed file systems.
- Open Source Leadership: Maintainer status or significant contributions to the Linux Kernel, Kubernetes, or specialized HPC projects.
- AI/ML Workload Expertise: Direct experience optimizing infrastructure for Large Language Model (LLM) training and inference at scale.
Compensation
Compensation Range: $260,000 - $340,000 + Significant Equity & Bonus. Compensation is determined by the applicant's depth of expertise, previous impact at scale, and alignment with our architectural goals.