Skip to content

Principal Operations Engineer, Hardware

150k – 250kUnited StatesDevOps / SRERemote10+ YOE
Summary

Principal technical authority for operational hardware fleet across hyperscale AI data centers. Lead site assessments, audits, root cause investigations, and drive operational readiness for GPU/server infrastructure at scale.

About the role

Responsibilities

  • Serve as the most senior technical authority for the operational hardware fleet across hyperscale AI data center portfolio
  • Lead site assessments and operational audits
  • Drive technical readiness of teams ahead of site activation
  • Review hardware platforms and integration designs from an operational lens
  • Feed operational learnings back into hardware engineering, deployment, and supply chain organizations
  • Act as a force multiplier across site hardware leads, deployment teams, and reliability engineers
  • Serve as connective tissue between hardware operations, hardware engineering, network, facilities, and customer-facing teams
  • Diagnose hardware issues on the floor, lead fleet-wide root cause investigations, and hold vendors accountable on RMA processes
  • Author, approve, and execute high-risk MOPs and change records in live production environments
  • Lead root cause analysis on significant hardware events and drive corrective actions to closure
  • Travel extensively across the fleet (50-75%)

Requirements

  • 10+ years of hands-on experience operating mission-critical hardware infrastructure, with at least 5 years as the senior technical voice on a site, campus, or fleet
  • Deep working command of GPU systems, server platforms, storage infrastructure, firmware lifecycle management, and hardware diagnostics
  • Demonstrated ability to author, approve, and execute high-risk MOPs and change records in live production environments
  • Track record of leading root cause analysis on significant hardware events and driving corrective actions to closure
  • Track record of holding OEMs, ODMs, service vendors, and deployment partners accountable
  • Strong written communication skills for operational health assessments, RCAs, procedure reviews, and design review feedback
  • Comfort operating as the senior technical voice across operations, hardware engineering, network, facilities, supply chain, and customer-facing teams
  • Willingness to travel extensively across the fleet (50-75%)

Preferred Qualifications

  • Bachelor's degree in Computer Engineering, Electrical Engineering, Computer Science, or related field
  • Hyperscale or large-scale compute operational experience supporting thousands of servers and accelerator systems
  • Direct experience operating modern GPU platforms at production scale
  • Strong working knowledge of Linux administration, hardware management tooling, and production troubleshooting workflows
  • Experience supporting liquid-cooled compute infrastructure
  • Experience operating across multiple sites or as part of a global fleet operations function
  • Experience standing up new sites from deployment handover through steady-state
  • Experience contributing operational requirements into hardware platform decisions, reference architectures, or productized data center builds
  • Scripting and automation experience in support of fleet-scale hardware operations

Compensation & Benefits

  • Base salary range: $150,000 - $250,000 per year
  • Competitive total compensation package (salary + equity)
  • Retirement or pension plan
  • Health, dental, and vision insurance
  • Generous PTO policy
Skills
GPU systemsServer platformsStorage infrastructureFirmware lifecycle managementHardware diagnosticsLinux administrationHardware management toolingRoot cause analysisMOPsLiquid cooling
Similar roles at this salary range
All DevOps / SRE jobs →
Northwood Space

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Forterra

Senior Software Engineer-Internal Tools

Senior Software Engineer on the DevOps and Tooling team building internal tools. Requires 3-5+ years experience, Rust or strong systems background, TypeScript/React, Linux, Docker, and CI/CD.

125k – 140kArlington, VA +1DevOps / SREOn-site5+ YOEAWSRust
Beacon AI

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue
MongoDB

Site Reliability Engineer

Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.

127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS