Datacenter Hardware Operations Technician, AI Compute Infrastructure - Stargate

86k – 228kUnited StatesHardware EngineeringRemote8+ YOEJun 20

Summary

Senior on-site hardware operations lead responsible for server, GPU, storage, and rack infrastructure reliability at OpenAI's AI campuses. Requires 8+ years of large-scale datacenter hardware experience and strong troubleshooting and cross-functional leadership skills.

About the role

Key Responsibilities

Serve as OpenAI’s senior on-site hardware operations lead for server, GPU, storage, and rack-level infrastructure
Drive technical triage and resolution of complex hardware failures impacting production systems
Partner with Fleet Health Engineering to investigate recurring hardware issues, identify failure patterns, and improve fleet reliability
Lead root cause analysis (RCA) efforts for critical hardware incidents and develop corrective and preventive action plans
Collaborate with Oracle operations teams and OEM vendors to coordinate repairs, replacements, upgrades, and hardware lifecycle activities
Establish and continuously improve hardware maintenance procedures, operational runbooks, and troubleshooting standards
Analyze hardware failure trends and operational metrics to identify reliability risks and improvement opportunities
Support new hardware introductions, validation activities, and production readiness reviews
Coordinate spare parts strategy and inventory planning with supply chain and operations teams
Partner with Hardware Engineering, Manufacturing, and Infrastructure teams to provide field feedback that improves future platform designs
Develop scalable operational standards and best practices that can be deployed across future Stargate campuses
Mentor technicians and partner teams on advanced troubleshooting methodologies and hardware operational excellence

Qualifications

8+ years of experience supporting large-scale datacenter hardware infrastructure, with experience in a senior technician, sustaining engineering, or hardware operations leadership role
Deep expertise with server platforms, GPU systems, storage infrastructure, rack integration, and datacenter hardware architecture
Strong experience diagnosing complex hardware failures and leading repair efforts in production environments
Experience conducting root cause analysis and driving long-term corrective actions
Strong understanding of hardware reliability engineering principles and fleet-health management
Proven ability to partner effectively across engineering, operations, manufacturing, and vendor organizations
Comfortable operating independently in high-priority production environments with significant operational responsibility
Excellent written and verbal communication skills with the ability to influence technical and operational decisions
Experience developing operational processes, maintenance standards, and technical documentation
Ability to travel occasionally to support new campus deployments and operational readiness activities

Preferred Qualifications

Experience supporting large-scale GPU clusters or AI/ML infrastructure environments
Familiarity with fleet health systems, telemetry platforms, and hardware monitoring tools
Experience with failure analysis methodologies such as FRACAS, RCCA, 5-Why, Fishbone, or FMEA
Knowledge of Linux system administration and hardware validation workflows
Experience supporting hyperscale datacenter operations or HPC environments
Familiarity with server manufacturing, rack integration, or NPI-to-sustaining transitions
Industry certifications such as CompTIA Server+, OEM hardware certifications, or equivalent experience
Experience applying Environmental Health and Safety (EHS) practices in mission-critical datacenter environments

Skills

Server PlatformsGPU SystemsStorage InfrastructureRack IntegrationDatacenter Hardware ArchitectureRoot Cause AnalysisHardware Reliability EngineeringFleet Health ManagementLinux System AdministrationHardware Validation

Similar roles at this salary range

All Hardware Engineering jobs →

Icarus

Jun 25

Manufacturing Engineer

Own manufacturing and production of solar-powered stratospheric aircraft hardware from prototype through scaled production at the El Segundo factory. Requires 1+ years building real mechanical, electrical, or aerospace hardware with strong DFM collaboration skills.

100k – 160kEl Segundo, CAHardware EngineeringOn-site1+ YOETooling DesignBuild Engineering

Icarus

Jun 25

Electrical Engineer

Design and own electrical subsystems and high-reliability avionics for solar-powered stratospheric aircraft from concept through flight. Requires 1+ years building real electronic hardware and strong EE fundamentals.

100k – 160kEl Segundo, CAHardware EngineeringOn-site1+ YOEAvionicsPCB Design

Shield AI

Jun 24

Staff Field Services Engineer, Structures

Senior individual contributor providing technical ownership of V-BAT airframe, structural, and mechanical sustainment issues. Evaluates damage and develops repair guidance while partnering with engineering, quality, and operations teams.

100k – 150kDallas, TXHardware EngineeringOn-site5+ YOEMRBCAD

Shield AI

Jun 10

Engineer II, Structural Analysis

Perform structural analysis and testing for advanced UAV airframes using FEA, hand calculations, and mechanical testing to ensure integrity, performance, and safety.

88k – 130kSan Diego, CA +1Hardware EngineeringOn-site2+ YOENXANSYS

Shield AI

Jun 10

Aerostructures Design Engineer II

Design and release structural components for next-generation autonomous aircraft. Requires 2+ years aircraft structural design experience, CAD proficiency, and knowledge of composites, metallic structures, and GD&T.

88k – 132kSan Diego, CA +1Hardware EngineeringOn-site2+ YOENXFEA

Apply