# Infrastructure Engineer (Mid/Senior/Staff)
**Company:** [Fal](https://hotfix.jobs/companies/fal)
**Location:** San Francisco, CA
**Salary:** $180K-$250K
**Skills:** Python, Linux, Ansible, Terraform, Nvidia Gpu, CUDA, Kubernetes, Nvme, Nfs, Selinux, Prometheus, Grafana, Dcgm, Lvm, Raid
**Posted:** 2026-04-30
> Builds and maintains Python-based tooling for managing large-scale GPU server fleets, including provisioning, health monitoring, AI-driven recovery, Linux tuning, and security hardening. Requires 3+ years managing server fleets at scale with strong Python and Linux expertise.
## Job Description
## Key Responsibilities
- Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc
- Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting
- Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals)
- Leverage AI to an extreme level to build tools and automate alerting and recovery
- Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation
- Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage
- Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)
- Develop a suite of automated error detection and recovery processes
- Work with partners to solve technical issues

## Requirements
- 3+ years experience managing bare-metal and cloud based server fleets at scale (100+ nodes)
- Strong software engineering skills in **Python**; you write production tooling, not scripts
- Deep **Linux** systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling
- Strong experience with configuration management and infrastructure-as-code: **Ansible**, **Terraform**, cloud-init
- Solid understanding of storage technologies: **LVM**, **RAID**, **NVMe**, **NFS**, Lustre or GPFS, and Linux I/O stack tuning
- Familiarity with hardware diagnostics and failure modes (**GPUs**, **NVMe**, NICs, memory)
- Experience building internal tools or dashboards for infrastructure visibility
- Excellent communication and ability to drive technical decisions across teams
- Self-starter who executes quickly, takes ownership, and constantly seeks improvement

## Nice to Have
- Familiarity with network configuration and diagnostics (**VLAN**, **VXLAN**, **ECMP**, **BGP**, tcpdump)
- Experience with **NVIDIA GPU** infrastructure: driver management, health monitoring, **DCGM**, **NVLink/NVSwitch** diagnostics, **RDMA**, **InfiniBand/RoCEv2**
- Experience with **AMD GPUs**
- Experience with bare metal and VM provisioning (**PXE/iPXE**, **Kickstart**, libvirt, **Qemu/KVM**)
- Experience with compliance frameworks relevant to cloud providers (**SOC 2**, **ISO 27001**)

## Compensation
**$180,000-250,000** plus equity + benefits
**Apply:** https://hotfix.jobs/jobs/infrastructure-engineer-mid-senior-staff-at-fal-a1910672-607c-449a-88c8-f88532a787ce
**Canonical:** https://hotfix.jobs/jobs/infrastructure-engineer-mid-senior-staff-at-fal-a1910672-607c-449a-88c8-f88532a787ce