Skip to content

Engineering Manager, Fleet Reliability

Leads Fleet Reliability team to manage, provision, and maintain a scaling GPU fleet with 24/7 coverage. Drives automation, SRE practices like incident management and observability, and hires/develops the team. Requires 7+ years infrastructure/SRE experience with 2+ years leading.

United StatesEngineering ManagementRemote7+ YOE

About the role

Responsibilities

  • Build and lead the Fleet Reliability team: hire, develop, retain
  • Own 24/7 coverage for node provisioning, validation, and triage
  • Drive the automation roadmap: event-driven remediation, self-healing, observability
  • Define and enforce the SLAs that keep production GPUs serving traffic
  • Set the culture: how the team keeps score, how they communicate, how they grow

Requirements

  • 7+ years in infrastructure, software, or SRE, with 2+ years leading
  • Run a fleet reliability or hardware ops team in production
  • Built SRE fundamentals into a team from scratch: incident management, postmortems, observability, change management
  • Pushed teams toward automation over toil
  • Player-coach
  • Process-oriented without being bureaucratic
  • Allergic to toil. Every recurring problem is an automation opportunity
  • Carry the pager yourself before asking your team to

Skills

SREIncident ManagementPostmortemsObservabilityChange ManagementAutomationGpu ProvisioningNode ValidationSelf-Healing SystemsEvent-Driven RemediationSlasInfrastructureHardware Operations

Engineering Manager - Core Infra

Own Core Infra team end-to-end: set priorities, raise execution quality, and turn ICs into a self-organizing team. Requires 8+ years building production software (4+ in startups) managing platform engineers, plus deep familiarity with high-throughput and AI-powered systems.

190k – 250kNew York, NYEngineering ManagementHybrid8+ YOESlisSLOs

Senior Engineering Manager, AI Product

Lead engineering execution and people management for Thunderbolt, an open-source AI product. Manage senior engineers, contribute technically, and drive production-ready practices for enterprise-grade, privacy-first AI deployments.

215k – 240kUnited StatesEngineering ManagementRemote15+ YOETauriCI/CD

Senior Engineering Manager, AI Product

Lead engineering execution, people management, and operating practices for an open-source AI product moving from R&D to production. Manage senior engineers, contribute technically, and establish scalable engineering practices for enterprise-ready deployment.

215k – 240kUnited StatesEngineering ManagementRemote15+ YOEWebTauri

Manager, Applied AI Engineering

Lead and grow a team of Applied AI Engineers advising Enterprise Tech customers on Claude API deployments, architecture, evaluations, and advanced LLM patterns while partnering with Sales, Product, and Engineering.

300k – 405kSan Francisco, CA +1Engineering ManagementHybrid7+ YOELLMsPython

Manager, Platform Engineering

Hands-on Platform Engineering Manager leading a team to build standardized Kubernetes deployment, self-service tooling, auditable infrastructure, and CI/CD pipelines on AWS.

United StatesEngineering ManagementRemote7+ YOEAWSOkta