Principal Operations Engineer, Hardware
Principal technical authority for operational hardware fleet across hyperscale AI data centers. Lead site assessments, audits, root cause investigations, and drive operational readiness for GPU/server infrastructure at scale.
Responsibilities
- Serve as the most senior technical authority for the operational hardware fleet across hyperscale AI data center portfolio
- Lead site assessments and operational audits
- Drive technical readiness of teams ahead of site activation
- Review hardware platforms and integration designs from an operational lens
- Feed operational learnings back into hardware engineering, deployment, and supply chain organizations
- Act as a force multiplier across site hardware leads, deployment teams, and reliability engineers
- Serve as connective tissue between hardware operations, hardware engineering, network, facilities, and customer-facing teams
- Diagnose hardware issues on the floor, lead fleet-wide root cause investigations, and hold vendors accountable on RMA processes
- Author, approve, and execute high-risk MOPs and change records in live production environments
- Lead root cause analysis on significant hardware events and drive corrective actions to closure
- Travel extensively across the fleet (50-75%)
Requirements
- 10+ years of hands-on experience operating mission-critical hardware infrastructure, with at least 5 years as the senior technical voice on a site, campus, or fleet
- Deep working command of GPU systems, server platforms, storage infrastructure, firmware lifecycle management, and hardware diagnostics
- Demonstrated ability to author, approve, and execute high-risk MOPs and change records in live production environments
- Track record of leading root cause analysis on significant hardware events and driving corrective actions to closure
- Track record of holding OEMs, ODMs, service vendors, and deployment partners accountable
- Strong written communication skills for operational health assessments, RCAs, procedure reviews, and design review feedback
- Comfort operating as the senior technical voice across operations, hardware engineering, network, facilities, supply chain, and customer-facing teams
- Willingness to travel extensively across the fleet (50-75%)
Preferred Qualifications
- Bachelor's degree in Computer Engineering, Electrical Engineering, Computer Science, or related field
- Hyperscale or large-scale compute operational experience supporting thousands of servers and accelerator systems
- Direct experience operating modern GPU platforms at production scale
- Strong working knowledge of Linux administration, hardware management tooling, and production troubleshooting workflows
- Experience supporting liquid-cooled compute infrastructure
- Experience operating across multiple sites or as part of a global fleet operations function
- Experience standing up new sites from deployment handover through steady-state
- Experience contributing operational requirements into hardware platform decisions, reference architectures, or productized data center builds
- Scripting and automation experience in support of fleet-scale hardware operations
Compensation & Benefits
- Base salary range: $150,000 - $250,000 per year
- Competitive total compensation package (salary + equity)
- Retirement or pension plan
- Health, dental, and vision insurance
- Generous PTO policy
Senior Network Engineer
Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.
Site Reliability Engineer
Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.