Senior Staff Engineer, Cloud Site Operations
Leads technical architecture for data center operations, overseeing global ticket queues, fleet supportability, power topology, resilience planning, and hardware failure escalations for AI infrastructure. Requires 10+ years in data center ops or HPC with deep NVIDIA GPU expertise.
What You'll Be Working On
Operational Governance & Metrics
- Oversee the technical health of our global ticket queue
- Partner with internal teams to develop real-time dashboards and track the KPIs/SLAs (MTTR, fleet availability, sparing accuracy) that measure our operational maturity
Fleet Supportability & Tooling
- Partner with the Fleet Engineering team to define the software access, diagnostic hooks, and physical tooling required for maximum repair efficiency
- Act as the primary advocate for "serviceability" within the white space
Power Topology Strategy
- Lead the initiative to map end-to-end "Power Strings," from main distribution down to cabinet PDUs
- Lead the Build vs. Buy analysis to determine whether we develop internal mapping tools or procure a third-party solution
Operational Resilience
- Architect the framework for our Business Continuity (BCP) and Disaster Recovery (DR) plans
- Define the technical protocols for hardware recovery and site-level failovers to ensure minimal disruption to our AI Cloud customers
Technical Advisory & Documentation
- Provide expert guidance and architectural "sign-off" to the internal Documentation Committee
- Ensure all break-fix SOPs and technical playbooks are accurate, safe, and optimized for global scale
Advanced Escalation & Mentorship
- Serve as the final technical authority for systemic or complex hardware failures
- Mentor senior technicians and site leads, elevating the collective technical IQ of the global operations team
What You'll Bring to the Team
Technical Mastery
- 10+ years in Data Center Operations, Systems Engineering, or HPC hardware
- Expert-level understanding of x86/GPU server architecture and electrical distribution
The "Supportability" Mindset
- Proven experience in hardware maintenance at scale
- Translate field challenges into technical requirements for Engineering and Fleet teams to minimize downtime
Hardware Expertise
- Deep familiarity with high-density AI infrastructure, including current NVIDIA H200 and Blackwell (GB200) systems
- Architect support strategies for the transition to GB300 and Rubin platforms
Data-Driven Leadership
- Expert proficiency in defining operational KPIs and building dashboards (e.g., Tableau, Grafana) to drive "Operational Maturity"
Strategic Decision Making
- Experience performing Build vs. Buy analyses for technical tools and infrastructure software
Communication
- Exceptional ability to distill complex technical risks, ticket-queue trends, and infrastructure hurdles into clear, actionable strategies for senior leadership
Benefits
- Competitive compensation
- Restricted Stock Units
- Paid time off & paid holidays
- Comprehensive health, dental & vision insurance
- Employer contributions to HSA account
- Paid parental leave
- Paid life insurance, short-term and long-term disability
- Professional development & tuition reimbursement
- Mental health & wellness support
- Commuter benefits (parking & transit)
- Cell phone stipend
- 401(k) Retirement plan with company match up to 4% of salary
- Volunteer time off
Compensation Range
Compensation will be paid in the range of up to $179,000 - $218,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's knowledge, education, and abilities, as well as internal equity and alignment with market data.
Senior Infrastructure Engineer
Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Senior Developer Experience Engineer
Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.