Skip to content

Datacenter Hardware Operations Technician, AI Compute Infrastructure - Stargate

86k – 228kUnited StatesHardware EngineeringRemote8+ YOE
Summary

Senior on-site hardware operations lead responsible for server, GPU, storage, and rack infrastructure reliability at OpenAI's AI campuses. Requires 8+ years of large-scale datacenter hardware experience and strong troubleshooting and cross-functional leadership skills.

About the role

Key Responsibilities

  • Serve as OpenAI’s senior on-site hardware operations lead for server, GPU, storage, and rack-level infrastructure
  • Drive technical triage and resolution of complex hardware failures impacting production systems
  • Partner with Fleet Health Engineering to investigate recurring hardware issues, identify failure patterns, and improve fleet reliability
  • Lead root cause analysis (RCA) efforts for critical hardware incidents and develop corrective and preventive action plans
  • Collaborate with Oracle operations teams and OEM vendors to coordinate repairs, replacements, upgrades, and hardware lifecycle activities
  • Establish and continuously improve hardware maintenance procedures, operational runbooks, and troubleshooting standards
  • Analyze hardware failure trends and operational metrics to identify reliability risks and improvement opportunities
  • Support new hardware introductions, validation activities, and production readiness reviews
  • Coordinate spare parts strategy and inventory planning with supply chain and operations teams
  • Partner with Hardware Engineering, Manufacturing, and Infrastructure teams to provide field feedback that improves future platform designs
  • Develop scalable operational standards and best practices that can be deployed across future Stargate campuses
  • Mentor technicians and partner teams on advanced troubleshooting methodologies and hardware operational excellence

Qualifications

  • 8+ years of experience supporting large-scale datacenter hardware infrastructure, with experience in a senior technician, sustaining engineering, or hardware operations leadership role
  • Deep expertise with server platforms, GPU systems, storage infrastructure, rack integration, and datacenter hardware architecture
  • Strong experience diagnosing complex hardware failures and leading repair efforts in production environments
  • Experience conducting root cause analysis and driving long-term corrective actions
  • Strong understanding of hardware reliability engineering principles and fleet-health management
  • Proven ability to partner effectively across engineering, operations, manufacturing, and vendor organizations
  • Comfortable operating independently in high-priority production environments with significant operational responsibility
  • Excellent written and verbal communication skills with the ability to influence technical and operational decisions
  • Experience developing operational processes, maintenance standards, and technical documentation
  • Ability to travel occasionally to support new campus deployments and operational readiness activities

Preferred Qualifications

  • Experience supporting large-scale GPU clusters or AI/ML infrastructure environments
  • Familiarity with fleet health systems, telemetry platforms, and hardware monitoring tools
  • Experience with failure analysis methodologies such as FRACAS, RCCA, 5-Why, Fishbone, or FMEA
  • Knowledge of Linux system administration and hardware validation workflows
  • Experience supporting hyperscale datacenter operations or HPC environments
  • Familiarity with server manufacturing, rack integration, or NPI-to-sustaining transitions
  • Industry certifications such as CompTIA Server+, OEM hardware certifications, or equivalent experience
  • Experience applying Environmental Health and Safety (EHS) practices in mission-critical datacenter environments
Skills
Server PlatformsGPU SystemsStorage InfrastructureRack IntegrationDatacenter Hardware ArchitectureRoot Cause AnalysisHardware Reliability EngineeringFleet Health ManagementLinux System AdministrationHardware Validation
Similar roles at this salary range
All Hardware Engineering jobs →
Icarus

Manufacturing Engineer

Own manufacturing and production of solar-powered stratospheric aircraft hardware from prototype through scaled production at the El Segundo factory. Requires 1+ years building real mechanical, electrical, or aerospace hardware with strong DFM collaboration skills.

100k – 160kEl Segundo, CAHardware EngineeringOn-site1+ YOETooling DesignBuild Engineering
Icarus

Electrical Engineer

Design and own electrical subsystems and high-reliability avionics for solar-powered stratospheric aircraft from concept through flight. Requires 1+ years building real electronic hardware and strong EE fundamentals.

100k – 160kEl Segundo, CAHardware EngineeringOn-site1+ YOEAvionicsPCB Design
Shield AI

Staff Field Services Engineer, Structures

Senior individual contributor providing technical ownership of V-BAT airframe, structural, and mechanical sustainment issues. Evaluates damage and develops repair guidance while partnering with engineering, quality, and operations teams.

100k – 150kDallas, TXHardware EngineeringOn-site5+ YOEMRBCAD
Shield AI

Engineer II, Structural Analysis

Perform structural analysis and testing for advanced UAV airframes using FEA, hand calculations, and mechanical testing to ensure integrity, performance, and safety.

88k – 130kSan Diego, CA +1Hardware EngineeringOn-site2+ YOENXANSYS
Shield AI

Aerostructures Design Engineer II

Design and release structural components for next-generation autonomous aircraft. Requires 2+ years aircraft structural design experience, CAD proficiency, and knowledge of composites, metallic structures, and GD&T.

88k – 132kSan Diego, CA +1Hardware EngineeringOn-site2+ YOENXFEA