# Operations Engineer, HPC Networking
**Company:** [Fal](https://hotfix.jobs/companies/fal)
**Location:** Remote
**Skills:** InfiniBand, Ethernet, Nccl, Subnet Manager, Hcas, Switch Firmware, Roce, Spectrum-X, Bash, Python, Go, Gpu Cluster Networking
**Posted:** 2026-05-14
> Hands-on operations engineer maintaining and troubleshooting high-performance InfiniBand and Ethernet networking fabrics in large-scale GPU clusters. Requires production experience with subnet management, full-stack debugging, fabric bring-up, and scripting.
## Job Description
## Responsibilities
- Monitor health and performance of InfiniBand and Ethernet fabrics: switches, HCAs, transceivers, links.
- Investigate and resolve fabric issues: connectivity, congestion, performance regressions.
- Support fabric bring-up alongside DC ops and customer-facing teams.
- Run maintenance and upgrades on switches and control plane components.
- Partner with cluster ops on cross-domain incidents where the line between compute and network is blurry.
- Improve the tooling and runbooks so the next incident resolves faster.

## Requirements
- Operated InfiniBand fabrics in production: subnet manager, routing, partitioning, monitoring.
- Debugged the full stack: cables, transceivers, switch firmware, HCAs, drivers, NCCL.
- Brought up new fabrics from cable pull through validation.
- Scripted your way through repetitive operational work (**bash**, **python**, **go**).

## Nice to Have
- Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking.
**Apply:** https://hotfix.jobs/jobs/operations-engineer-hpc-networking-at-fal-753975fd-6ada-4513-9160-8f3b456619a5
**Canonical:** https://hotfix.jobs/jobs/operations-engineer-hpc-networking-at-fal-753975fd-6ada-4513-9160-8f3b456619a5