Skip to content

Staff Software Engineer, CAPE

209k – 253kSan Francisco, CASunnyvale, CAOnsite10+ YOE
Summary

Architects and builds intelligence layer for GPU fleet management, including Virtual Pool Service and Capacity Management Intelligence systems. Requires 10+ years in distributed systems, fluency in Go or similar, and Bachelor's in CS.

About the role

What You'll Be Working On

  • Building the Virtual Pool Service (VP Service), a physical infrastructure classification layer that serves as the single source of truth for every GPU node's state, pool membership, and transition history across Crusoe's fleet
  • Designing and implementing Capacity Management Intelligence (CMI), the automation layer that handles priority-descending allocation, forward availability forecasting, and automated node lifecycle transitions — replacing manual spreadsheet workflows with enforced, auditable, event-driven automation
  • Collaborating extensively across teams to architect and implement physical infrastructure management systems, availability platforms, and frameworks that meet end-to-end customer use cases
  • Championing reliability, scalability, and security of our systems, designing high-performing, highly available cloud architectures optimized for both performance and cost-effectiveness
  • Streamlining cloud deployment, configuration management, and operations using Go, gRPC, NATS event streaming, PostgreSQL (CNPG on Kubernetes), and Netbox as the physical source of truth
  • Mentoring fellow engineers and actively contributing to team growth in collaboration with engineering managers

What You'll Bring to the Team

  • A Bachelor's degree in Computer Science or Software Engineering
  • 10+ years of relevant experience building and operating distributed systems at scale
  • Proven experience building reliable, scalable, and secure cloud platforms and running them in production
  • Strong distributed systems thinking with the ability to reason about consistency, failure modes, event ordering, and correctness invariants
  • Fluency in Go, Rust, Java, or C++; Go is our primary language, but strong engineers from other backgrounds ramp quickly
  • A collaborative, platform-minded approach to building robust systems and driving adoption across dev and ops teams
  • Ownership mentality with comfort owning a system end to end: design, implementation, testing, ops, and iteration
  • Good judgment under ambiguity, with the ability to drive open-ended technical decisions to resolution
  • Excellent communication and troubleshooting skills across cross-functional teams

Bonus Points

  • Hands-on experience deploying, managing, and troubleshooting Kubernetes clusters
  • Prior experience with event-driven architectures or message streaming systems (NATS, Kafka, Kinesis)
  • Experience with capacity planning, resource scheduling, or fleet management systems
  • Background in GPU compute, AI/ML platform infrastructure, or fast-paced startup environments
  • A passion for sustainability, clean energy, and building AI infrastructure that scales responsibly

Compensation

Compensation will be paid in the range of $209,000 - $253,000. Restricted Stock Units are included in all offers.

Skills
GoKubernetesgRPCNATSPostgreSQLdistributed systemsRustJavaC++Netbox
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Aurelian

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS
Watershed

Software Engineer, Developer Tooling

Software engineer building developer tooling, AI automation, and test infrastructure to improve productivity and reliability for Watershed engineering teams.

174k – 230kSan Francisco, CADevOps / SREOn-siteCI/CDTemporal