Skip to content

Staff Infrastructure Engineer

208k – 253kSan Francisco, CASunnyvale, CAOnsite
Summary

Staff Infrastructure Engineer manages cloud infrastructure operations, develops automation for server provisioning, scales deployments, troubleshoots GPU hardware, and leads Kubernetes transition. Requires strong Linux, hardware, and Kubernetes expertise.

About the role

What You'll Be Doing

  • Manage and maintain day-to-day operations of Crusoe’s cloud infrastructure.
  • Develop automation tools to streamline server provisioning and reduce SLA times.
  • Scale infrastructure to support mass deployments (80-100 servers simultaneously).
  • Troubleshoot hardware issues, especially with GPUs, and liaise with vendors.
  • Transition Crusoe’s environment to Kubernetes and containerized workflows.

What You’ll Bring to the Team

  • Solid hardware experience and GPU troubleshooting expertise.
  • Strong Linux background.
  • Knowledge of PXE booting and server provisioning (bare metal).
  • Experience with BMC/IPMI, BIOS, and enterprise-grade server management.
  • Kubernetes proficiency (admin or developer).
  • Familiarity with containerization technologies (Docker preferred).
  • Experience with version control systems (Gitlab).

Nice to haves:

  • Experience with MAAS.
  • Proficiency in Python or Golang (preferred language).
  • Kubernetes administration and deployment experience.
  • Experience with Ansible and Terraform.

Compensation

$208,000 - $253,000 + Bonus. Restricted Stock Units are included in all offers.

Skills
KubernetesLinuxDockerAnsibleTerraformPythonGolangGitLabIPMIPXE
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Aurelian

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS
Watershed

Software Engineer, Developer Tooling

Software engineer building developer tooling, AI automation, and test infrastructure to improve productivity and reliability for Watershed engineering teams.

174k – 230kSan Francisco, CADevOps / SREOn-siteCI/CDTemporal