# Engineering Manager, Fleet Reliability
**Company:** [Fal](https://hotfix.jobs/companies/fal)
**Location:** Remote
**Experience:** 7+ years
**Skills:** SRE, Incident Management, Postmortems, Observability, Change Management, Automation, Gpu Provisioning, Node Validation, Self-Healing Systems, Event-Driven Remediation, Slas, Infrastructure, Hardware Operations
**Posted:** 2026-05-14
> Leads Fleet Reliability team to manage, provision, and maintain a scaling GPU fleet with 24/7 coverage. Drives automation, SRE practices like incident management and observability, and hires/develops the team. Requires 7+ years infrastructure/SRE experience with 2+ years leading.
## Job Description
## Responsibilities
- Build and lead the Fleet Reliability team: hire, develop, retain
- Own 24/7 coverage for node provisioning, validation, and triage
- Drive the automation roadmap: event-driven remediation, self-healing, observability
- Define and enforce the SLAs that keep production GPUs serving traffic
- Set the culture: how the team keeps score, how they communicate, how they grow

## Requirements
- 7+ years in infrastructure, software, or SRE, with 2+ years leading
- Run a fleet reliability or hardware ops team in production
- Built SRE fundamentals into a team from scratch: incident management, postmortems, observability, change management
- Pushed teams toward automation over toil
- Player-coach
- Process-oriented without being bureaucratic
- Allergic to toil. Every recurring problem is an automation opportunity
- Carry the pager yourself before asking your team to
**Apply:** https://hotfix.jobs/jobs/engineering-manager-fleet-reliability-at-fal-fca7defc-7181-4123-ba06-850bcb2f3603
**Canonical:** https://hotfix.jobs/jobs/engineering-manager-fleet-reliability-at-fal-fca7defc-7181-4123-ba06-850bcb2f3603