Principal Operations Engineer, Hardware

150k – 250kUnited StatesDevOps / SRERemote10+ YOEJun 9

Summary

Principal technical authority for operational hardware fleet across hyperscale AI data centers. Lead site assessments, audits, root cause investigations, and drive operational readiness for GPU/server infrastructure at scale.

About the role

Responsibilities

Serve as the most senior technical authority for the operational hardware fleet across hyperscale AI data center portfolio
Lead site assessments and operational audits
Drive technical readiness of teams ahead of site activation
Review hardware platforms and integration designs from an operational lens
Feed operational learnings back into hardware engineering, deployment, and supply chain organizations
Act as a force multiplier across site hardware leads, deployment teams, and reliability engineers
Serve as connective tissue between hardware operations, hardware engineering, network, facilities, and customer-facing teams
Diagnose hardware issues on the floor, lead fleet-wide root cause investigations, and hold vendors accountable on RMA processes
Author, approve, and execute high-risk MOPs and change records in live production environments
Lead root cause analysis on significant hardware events and drive corrective actions to closure
Travel extensively across the fleet (50-75%)

Requirements

10+ years of hands-on experience operating mission-critical hardware infrastructure, with at least 5 years as the senior technical voice on a site, campus, or fleet
Deep working command of GPU systems, server platforms, storage infrastructure, firmware lifecycle management, and hardware diagnostics
Demonstrated ability to author, approve, and execute high-risk MOPs and change records in live production environments
Track record of leading root cause analysis on significant hardware events and driving corrective actions to closure
Track record of holding OEMs, ODMs, service vendors, and deployment partners accountable
Strong written communication skills for operational health assessments, RCAs, procedure reviews, and design review feedback
Comfort operating as the senior technical voice across operations, hardware engineering, network, facilities, supply chain, and customer-facing teams
Willingness to travel extensively across the fleet (50-75%)

Preferred Qualifications

Bachelor's degree in Computer Engineering, Electrical Engineering, Computer Science, or related field
Hyperscale or large-scale compute operational experience supporting thousands of servers and accelerator systems
Direct experience operating modern GPU platforms at production scale
Strong working knowledge of Linux administration, hardware management tooling, and production troubleshooting workflows
Experience supporting liquid-cooled compute infrastructure
Experience operating across multiple sites or as part of a global fleet operations function
Experience standing up new sites from deployment handover through steady-state
Experience contributing operational requirements into hardware platform decisions, reference architectures, or productized data center builds
Scripting and automation experience in support of fleet-scale hardware operations

Compensation & Benefits

Base salary range: $150,000 - $250,000 per year
Competitive total compensation package (salary + equity)
Retirement or pension plan
Health, dental, and vision insurance
Generous PTO policy

Skills

GPU systemsServer platformsStorage infrastructureFirmware lifecycle managementHardware diagnosticsLinux administrationHardware management toolingRoot cause analysisMOPsLiquid cooling

Similar roles at this salary range

All DevOps / SRE jobs →

Northwood Space

Jun 19

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Forterra

Jun 18

Senior Software Engineer-Internal Tools

Senior Software Engineer on the DevOps and Tooling team building internal tools. Requires 3-5+ years experience, Rust or strong systems background, TypeScript/React, Linux, Docker, and CI/CD.

125k – 140kArlington, VA +1DevOps / SREOn-site5+ YOEAWSRust

Beacon AI

Jun 17

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue

MongoDB

Jun 17

Site Reliability Engineer

Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.

127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS

Apply