Skip to content

Principal Production Engineer

Owns reliability, scalability, and observability of cloud infrastructure including compute, storage, and networking at massive scale. Drives SLOs, incident response, tooling, and mentors engineers; requires 15+ years experience with data centers and internet-scale operations.

261k – 326kSan Francisco, CASunnyvale, CADevOps / SREOnsite15+ YOE

About the role

What You'll Be Working On

  • Own the reliability and scalability of Crusoe's cloud infrastructure — compute, storage, and networking — defining SLOs, leading incident response, and driving systemic improvements that reduce toil and raise the bar across the platform
  • Build and mature the observability and tooling layer — from network fabric telemetry and storage health to control plane instrumentation and on-call tooling — so the team can detect, diagnose, and resolve issues faster than customers notice them
  • Drive platform reliability improvements across the full cloud stack, partnering closely with software, hardware, and network engineering teams to influence architecture decisions early, before they become operational debt
  • Act as a trusted advisor to senior leadership, bringing perspective on observability trends and advocating for the right long-term technology investments
  • Set the technical standards for how Crusoe's production engineering organization builds, operates, and scales — defining on-call culture, incident frameworks, and reliability practices that grow with the company
  • Mentor senior and staff engineers, elevate the team's collective technical depth, and be the person others seek out when the problem is genuinely hard

What You'll Bring to the Team

  • 15+ years of experience in infrastructure, networking, or production engineering — with meaningful time at companies operating at internet scale (cloud providers, CDNs, large-scale social/media platforms, or similar)
  • Strong systems fundamentals: Linux, distributed systems, storage, compute scheduling — you understand the full stack from hardware up
  • Hands-on data center experience: you've done physical infra, understand power and thermal constraints, and can reason about reliability at the facility level, not just the server level
  • The ability to write code — not necessarily full-time, but enough to automate what shouldn't be manual, instrument what isn't observable, and build tooling your team will actually use
  • Excellent analytical and problem-solving skills, including the ability to synthesize ambiguous customer and system signals into clear problem statements and designs
  • Strong incident command: you lead calmly under pressure, communicate clearly during outages, and run blameless retrospectives that actually improve systems

Bonus Points:

  • Deep networking expertise: BGP, OSPF, ECMP, load balancing, and low-latency network design in production — you can debug a routing issue and design a fabric, sometimes in the same incident
  • Experience with HPC infrastructure: GPU cluster operations, job schedulers (Slurm, Kubernetes), high-bandwidth interconnects (InfiniBand, RoCE)
  • Prior principal or staff IC role where you influenced org-level technical strategy, not just project-level execution
  • Exposure to sustainability-focused or energy-constrained compute environments

Benefits

  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit; $300 per month

Compensation

Compensation will be paid in the range of $261,000 - $326,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Skills

LinuxKubernetesSlurmBGPOspfEcmpInfiniBandRoceDistributed SystemsGpu Clusters

Similar roles

DevOps / SRE jobs

Principal Systems Software Engineer

Leads architecture of next-generation AI infrastructure, unifying BMaaS, IaaS, and CaaS with focus on high-performance I/O paths, kernel optimizations, and GPU workloads. Requires 12+ years hyperscale experience, deep Linux/virtualization expertise, and hardware-software co-design skills.

260k – 340kSan Francisco, CA +1DevOps / SREOn-site12+ YOEKvmQemu

Principal Engineer, Compute Fleet Management

Leads compute fleet management across AWS, Azure, and GCP, optimizing billions of resources for peak performance, 99.99% availability, and 60%+ utilization. Requires deep distributed systems expertise and cross-team leadership for mission-critical infrastructure.

264k – 322kBellevue, WADevOps / SREOn-siteAWSGCP

Senior Principal Software Engineer, Infrastructure

Technical visionary architecting Docker's foundational platform for accounts, billing, data, governance, and infrastructure. Drives cross-company strategy enabling enterprise growth, requiring 12+ years experience in large-scale distributed systems.

251k – 352kSeattle, WADevOps / SRERemote12+ YOEAWSGCP

Principal/Staff HPC Network Engineer

Designs, deploys, and maintains high-performance networks for large-scale GPU clusters in HPC environments. Requires 10+ years experience with InfiniBand/RoCEv2 in CLOS topologies, automation, and hybrid work in San Francisco.

250k – 325kSan Francisco, CADevOps / SREHybrid10+ YOEKvmClos

Principal Systems Engineer

Principal Systems Engineer sets technical direction for core infrastructure, owns architecture for reliability and performance at scale, and mentors senior engineers. Requires deep expertise in virtualization, distributed storage like Ceph, and Linux kernel primitives.

280k – 380kNew York, NYDevOps / SREOn-siteQemuCeph