# Operations Engineer, Fleet Reliability
**Company:** [Fal](https://hotfix.jobs/companies/fal)
**Location:** Remote
**Skills:** Linux, GPU, Nvlink, Nccl, InfiniBand, Grafana, Prometheus, Bash, Python, Go
**Posted:** 2026-05-14
> Hands-on operations engineer maintaining GPU clusters (B300, H200, H100), troubleshooting hardware/software issues, monitoring fleet health, and automating runbooks. Requires Linux admin, GPU debugging, observability tools experience, and on-call comfort.
## Job Description
## Responsibilities
- Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters
- Troubleshoot hardware and software issues across compute, network, and storage
- Monitor fleet health, take remediation action, push fixes upstream when needed
- Write the runbooks. Improve the ones that exist. Delete the ones that don't work

## Requirements
- Administered Linux Systems in the critical path before
- Troubleshooted GPU node issues: NVLink, NCCL, IB, driver and firmware bugs
- Has experience in observability systems like Grafana and Prometheus
- Scripted your way out of repetitive work (bash, python, go, whatever)
- Curious. You don't accept "it's flaky" as a root cause
- Comfortable with ambiguity. The runbook doesn't exist yet for half of what you'll do
- On-call doesn't scare you
- You'd rather automate a problem than fix it twice
**Apply:** https://hotfix.jobs/jobs/operations-engineer-fleet-reliability-at-fal-40e46748-5f94-4d69-8aa1-5e4f8e81c1eb
**Canonical:** https://hotfix.jobs/jobs/operations-engineer-fleet-reliability-at-fal-40e46748-5f94-4d69-8aa1-5e4f8e81c1eb