Staff Software Engineer, CAPE

209k – 253kSan Francisco, CASunnyvale, CAOnsite10+ YOEApr 15

Summary

Architects and builds intelligence layer for GPU fleet management, including Virtual Pool Service and Capacity Management Intelligence systems. Requires 10+ years in distributed systems, fluency in Go or similar, and Bachelor's in CS.

About the role

What You'll Be Working On

Building the Virtual Pool Service (VP Service), a physical infrastructure classification layer that serves as the single source of truth for every GPU node's state, pool membership, and transition history across Crusoe's fleet
Designing and implementing Capacity Management Intelligence (CMI), the automation layer that handles priority-descending allocation, forward availability forecasting, and automated node lifecycle transitions — replacing manual spreadsheet workflows with enforced, auditable, event-driven automation
Collaborating extensively across teams to architect and implement physical infrastructure management systems, availability platforms, and frameworks that meet end-to-end customer use cases
Championing reliability, scalability, and security of our systems, designing high-performing, highly available cloud architectures optimized for both performance and cost-effectiveness
Streamlining cloud deployment, configuration management, and operations using Go, gRPC, NATS event streaming, PostgreSQL (CNPG on Kubernetes), and Netbox as the physical source of truth
Mentoring fellow engineers and actively contributing to team growth in collaboration with engineering managers

What You'll Bring to the Team

A Bachelor's degree in Computer Science or Software Engineering
10+ years of relevant experience building and operating distributed systems at scale
Proven experience building reliable, scalable, and secure cloud platforms and running them in production
Strong distributed systems thinking with the ability to reason about consistency, failure modes, event ordering, and correctness invariants
Fluency in Go, Rust, Java, or C++; Go is our primary language, but strong engineers from other backgrounds ramp quickly
A collaborative, platform-minded approach to building robust systems and driving adoption across dev and ops teams
Ownership mentality with comfort owning a system end to end: design, implementation, testing, ops, and iteration
Good judgment under ambiguity, with the ability to drive open-ended technical decisions to resolution
Excellent communication and troubleshooting skills across cross-functional teams

Bonus Points

Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters
Prior experience with event-driven architectures or message streaming systems (NATS, Kafka, Kinesis)
Experience with capacity planning, resource scheduling, or fleet management systems
Background in GPU compute, AI/ML platform infrastructure, or fast-paced startup environments
A passion for sustainability, clean energy, and building AI infrastructure that scales responsibly

Compensation

Compensation will be paid in the range of $209,000 - $253,000. Restricted Stock Units are included in all offers.

Skills

GoKubernetesgRPCNATSPostgreSQLdistributed systemsRustJavaC++Netbox

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 8

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit

Aurelian

Jun 8

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Crusoe

Jun 5

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS

Watershed

Jun 5

Software Engineer, Developer Tooling

Software engineer building developer tooling, AI automation, and test infrastructure to improve productivity and reliability for Watershed engineering teams.

174k – 230kSan Francisco, CADevOps / SREOn-siteCI/CDTemporal

Apply