Skip to content

Principal Systems Software Engineer

Leads architecture of next-generation AI infrastructure, unifying BMaaS, IaaS, and CaaS with focus on high-performance I/O paths, kernel optimizations, and GPU workloads. Requires 12+ years hyperscale experience, deep Linux/virtualization expertise, and hardware-software co-design skills.

260k – 340kSan Francisco, CASunnyvale, CADevOps / SREOnsite12+ YOE

About the role

What You’ll Be Working On

Unifying Infrastructure Pillars

  • Bare-Metal-as-a-Service (BMaaS): Architect systems that deliver raw GPU throughput via zero-latency InfiniBand/RDMA fabrics for massive-scale training.
  • Intelligent IaaS: Design highly optimized, thin virtualization layers using KVM or custom micro-VMs to provide enterprise-grade isolation without the "virtualization tax."
  • Elastic CaaS: Build a high-performance container substrate (utilizing Kubernetes or Slurm) that allows AI workloads to burst and scale across heterogeneous GPU nodes.

Mastering the I/O Path

  • Lead the architectural design of our internal cloud fabric, drawing on experience from top-tier hyperscalers to drive the technical roadmap for SR-IOV, RDMA, and virtualized GPU scheduling.

Advanced R&D Leadership

  • Lead elite workstreams to prototype and productionize novel methods for managing memory, networking, and compute that don't yet exist in standard cloud distributions.

Technical Strategy & Documentation

  • Draft white papers and RFCs that define the next two years of Crusoe’s compute and networking stack.

High-Level Debugging

  • Work alongside Staff and Senior engineers to resolve complex race conditions in the I/O path and optimize kernel-level memory pinning for GPU clusters.

Industry Influence

  • Represent Crusoe in open-source communities and industry forums to influence the global direction of cloud-native AI infrastructure.

What You’ll Bring to the Team

  • Hyperscale Provenance: 12+ years of experience designing and shipping core infrastructure at a major hyperscaler (e.g., OCI, AWS, Azure, GCP) or a specialized HPC cloud.
  • Deep Systems Authority: Authoritative knowledge of the Linux kernel, virtualization internals (KVM, QEMU, Firecracker), and high-performance networking (RoCE v2, InfiniBand).
  • Hardware-Software Co-Design: Proven ability to design software that maximizes the performance of NVIDIA/AMD GPUs and high-speed NICs.
  • R&D Leadership: Experience leading cross-functional teams through high-ambiguity projects and delivering production-ready, mission-critical systems.
  • Industry Contributions: A portfolio of significant contributions to the field, which may include patents, major open-source contributions, or published research in distributed systems.
  • Communication Mastery: The rare ability to explain the nuances of memory-mapped I/O to an engineer and the business value of a new fabric architecture to the Board.

Mandatory Education: A Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or a related analytical field (or equivalent professional experience).

Bonus Points:

  • Patent Holder: Possession of patents related to network virtualization, GPU scheduling, or distributed file systems.
  • Open Source Leadership: Maintainer status or significant contributions to the Linux Kernel, Kubernetes, or specialized HPC projects.
  • AI/ML Workload Expertise: Direct experience optimizing infrastructure for Large Language Model (LLM) training and inference at scale.

Compensation

Compensation Range: $260,000 - $340,000 + Significant Equity & Bonus. Compensation is determined by the applicant's depth of expertise, previous impact at scale, and alignment with our architectural goals.

Skills

Linux KernelKvmQemuFirecrackerInfiniBandRdmaRoce V2Sr-IovKubernetesSlurmNvidia GpusAmd Gpus

Similar roles

DevOps / SRE jobs

Principal Production Engineer

Owns reliability, scalability, and observability of cloud infrastructure including compute, storage, and networking at massive scale. Drives SLOs, incident response, tooling, and mentors engineers; requires 15+ years experience with data centers and internet-scale operations.

261k – 326kSan Francisco, CA +1DevOps / SREOn-site15+ YOEBGPOspf

Principal Engineer, Compute Fleet Management

Leads compute fleet management across AWS, Azure, and GCP, optimizing billions of resources for peak performance, 99.99% availability, and 60%+ utilization. Requires deep distributed systems expertise and cross-team leadership for mission-critical infrastructure.

264k – 322kBellevue, WADevOps / SREOn-siteAWSGCP

Senior Principal Software Engineer, Infrastructure

Technical visionary architecting Docker's foundational platform for accounts, billing, data, governance, and infrastructure. Drives cross-company strategy enabling enterprise growth, requiring 12+ years experience in large-scale distributed systems.

251k – 352kSeattle, WADevOps / SRERemote12+ YOEAWSGCP

Principal/Staff HPC Network Engineer

Designs, deploys, and maintains high-performance networks for large-scale GPU clusters in HPC environments. Requires 10+ years experience with InfiniBand/RoCEv2 in CLOS topologies, automation, and hybrid work in San Francisco.

250k – 325kSan Francisco, CADevOps / SREHybrid10+ YOEKvmClos

Principal Systems Engineer

Principal Systems Engineer sets technical direction for core infrastructure, owns architecture for reliability and performance at scale, and mentors senior engineers. Requires deep expertise in virtualization, distributed storage like Ceph, and Linux kernel primitives.

280k – 380kNew York, NYDevOps / SREOn-siteQemuCeph