Staff Software Engineer, CAPE
Architects and builds intelligence layer for GPU fleet management, including Virtual Pool Service and Capacity Management Intelligence systems. Requires 10+ years in distributed systems, fluency in Go or similar, and Bachelor's in CS.
What You'll Be Working On
- Building the Virtual Pool Service (VP Service), a physical infrastructure classification layer that serves as the single source of truth for every GPU node's state, pool membership, and transition history across Crusoe's fleet
- Designing and implementing Capacity Management Intelligence (CMI), the automation layer that handles priority-descending allocation, forward availability forecasting, and automated node lifecycle transitions — replacing manual spreadsheet workflows with enforced, auditable, event-driven automation
- Collaborating extensively across teams to architect and implement physical infrastructure management systems, availability platforms, and frameworks that meet end-to-end customer use cases
- Championing reliability, scalability, and security of our systems, designing high-performing, highly available cloud architectures optimized for both performance and cost-effectiveness
- Streamlining cloud deployment, configuration management, and operations using Go, gRPC, NATS event streaming, PostgreSQL (CNPG on Kubernetes), and Netbox as the physical source of truth
- Mentoring fellow engineers and actively contributing to team growth in collaboration with engineering managers
What You'll Bring to the Team
- A Bachelor's degree in Computer Science or Software Engineering
- 10+ years of relevant experience building and operating distributed systems at scale
- Proven experience building reliable, scalable, and secure cloud platforms and running them in production
- Strong distributed systems thinking with the ability to reason about consistency, failure modes, event ordering, and correctness invariants
- Fluency in Go, Rust, Java, or C++; Go is our primary language, but strong engineers from other backgrounds ramp quickly
- A collaborative, platform-minded approach to building robust systems and driving adoption across dev and ops teams
- Ownership mentality with comfort owning a system end to end: design, implementation, testing, ops, and iteration
- Good judgment under ambiguity, with the ability to drive open-ended technical decisions to resolution
- Excellent communication and troubleshooting skills across cross-functional teams
Bonus Points
- Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters
- Prior experience with event-driven architectures or message streaming systems (NATS, Kafka, Kinesis)
- Experience with capacity planning, resource scheduling, or fleet management systems
- Background in GPU compute, AI/ML platform infrastructure, or fast-paced startup environments
- A passion for sustainability, clean energy, and building AI infrastructure that scales responsibly
Compensation
Compensation will be paid in the range of $209,000 - $253,000. Restricted Stock Units are included in all offers.
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.