Senior Staff Engineer, Cloud Site Operations

179k – 218kSan Francisco, CASunnyvale, CAOnsite10+ YOEApr 6

Summary

Leads technical architecture for data center operations, overseeing global ticket queues, fleet supportability, power topology, resilience planning, and hardware failure escalations for AI infrastructure. Requires 10+ years in data center ops or HPC with deep NVIDIA GPU expertise.

About the role

What You'll Be Working On

Operational Governance & Metrics

Oversee the technical health of our global ticket queue
Partner with internal teams to develop real-time dashboards and track the KPIs/SLAs (MTTR, fleet availability, sparing accuracy) that measure our operational maturity

Fleet Supportability & Tooling

Partner with the Fleet Engineering team to define the software access, diagnostic hooks, and physical tooling required for maximum repair efficiency
Act as the primary advocate for "serviceability" within the white space

Power Topology Strategy

Lead the initiative to map end-to-end "Power Strings," from main distribution down to cabinet PDUs
Lead the Build vs. Buy analysis to determine whether we develop internal mapping tools or procure a third-party solution

Operational Resilience

Architect the framework for our Business Continuity (BCP) and Disaster Recovery (DR) plans
Define the technical protocols for hardware recovery and site-level failovers to ensure minimal disruption to our AI Cloud customers

Technical Advisory & Documentation

Provide expert guidance and architectural "sign-off" to the internal Documentation Committee
Ensure all break-fix SOPs and technical playbooks are accurate, safe, and optimized for global scale

Advanced Escalation & Mentorship

Serve as the final technical authority for systemic or complex hardware failures
Mentor senior technicians and site leads, elevating the collective technical IQ of the global operations team

What You'll Bring to the Team

Technical Mastery

10+ years in Data Center Operations, Systems Engineering, or HPC hardware
Expert-level understanding of x86/GPU server architecture and electrical distribution

The "Supportability" Mindset

Proven experience in hardware maintenance at scale
Translate field challenges into technical requirements for Engineering and Fleet teams to minimize downtime

Hardware Expertise

Deep familiarity with high-density AI infrastructure, including current NVIDIA H200 and Blackwell (GB200) systems
Architect support strategies for the transition to GB300 and Rubin platforms

Data-Driven Leadership

Expert proficiency in defining operational KPIs and building dashboards (e.g., Tableau, Grafana) to drive "Operational Maturity"

Strategic Decision Making

Experience performing Build vs. Buy analyses for technical tools and infrastructure software

Communication

Exceptional ability to distill complex technical risks, ticket-queue trends, and infrastructure hurdles into clear, actionable strategies for senior leadership

Benefits

Competitive compensation
Restricted Stock Units
Paid time off & paid holidays
Comprehensive health, dental & vision insurance
Employer contributions to HSA account
Paid parental leave
Paid life insurance, short-term and long-term disability
Professional development & tuition reimbursement
Mental health & wellness support
Commuter benefits (parking & transit)
Cell phone stipend
401(k) Retirement plan with company match up to 4% of salary
Volunteer time off

Compensation Range

Compensation will be paid in the range of up to $179,000 - $218,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's knowledge, education, and abilities, as well as internal equity and alignment with market data.

Skills

NVIDIA H200Blackwell (GB200)GB300Rubinx86GPUTableauGrafanaData Center OperationsHPC hardwarePower DistributionPDUsKubernetesNVIDIA GPUs

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 8

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit

Aurelian

Jun 8

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Aurelian

Jun 8

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Huntress

Jun 8

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby

Apply