Fal DevOps / SRE Jobs

Operations Engineer, HPC Networking

Hands-on operations engineer maintaining and troubleshooting high-performance InfiniBand and Ethernet networking fabrics in large-scale GPU clusters. Requires production experience with subnet management, full-stack debugging, fabric bring-up, and scripting.

United StatesDevOps / SRERemoteGoNccl

Operations Engineer, Fleet Reliability

Hands-on operations engineer maintaining GPU clusters (B300, H200, H100), troubleshooting hardware/software issues, monitoring fleet health, and automating runbooks. Requires Linux admin, GPU debugging, observability tools experience, and on-call comfort.

United StatesDevOps / SRERemoteGoGPU

Site Reliability Engineer (Mid/Senior/Staff)

Owns and operates Kubernetes infrastructure at scale, builds CI/CD pipelines, automates incident resolution with AI, defines SLOs, and drives reliability through chaos engineering. Requires 5+ years managing production systems, Kubernetes expertise, and monitoring tools like Prometheus/Grafana.

180k – 250kSan Francisco, CADevOps / SREHybrid5+ YOEGoEbpf

Infrastructure Engineer (Mid/Senior/Staff)

Builds and maintains Python-based tooling for managing large-scale GPU server fleets, including provisioning, health monitoring, AI-driven recovery, Linux tuning, and security hardening. Requires 3+ years managing server fleets at scale with strong Python and Linux expertise.

180k – 250kSan Francisco, CADevOps / SREHybridNfsLvm

Senior/Staff Virtualization Engineer

Builds custom compute environments including bare metal/virtual machines with GPU passthrough, dedicated Kubernetes clusters, and overlay networking for customer AI workloads. Requires 5+ years in Linux virtualization, strong networking, Kubernetes on bare metal, and NVIDIA GPU expertise.

180k – 250kSan Francisco, CADevOps / SREOn-site5+ YOEBGPVfio

5 jobs

DevOps / SREfal

5 jobs

Operations Engineer, HPC Networking

Hands-on operations engineer maintaining and troubleshooting high-performance InfiniBand and Ethernet networking fabrics in large-scale GPU clusters. Requires production experience with subnet management, full-stack debugging, fabric bring-up, and scripting.

United StatesDevOps / SRERemoteGoNccl

Operations Engineer, Fleet Reliability

Hands-on operations engineer maintaining GPU clusters (B300, H200, H100), troubleshooting hardware/software issues, monitoring fleet health, and automating runbooks. Requires Linux admin, GPU debugging, observability tools experience, and on-call comfort.

United StatesDevOps / SRERemoteGoGPU

Site Reliability Engineer (Mid/Senior/Staff)

Owns and operates Kubernetes infrastructure at scale, builds CI/CD pipelines, automates incident resolution with AI, defines SLOs, and drives reliability through chaos engineering. Requires 5+ years managing production systems, Kubernetes expertise, and monitoring tools like Prometheus/Grafana.

180k – 250kSan Francisco, CADevOps / SREHybrid5+ YOEGoEbpf

Infrastructure Engineer (Mid/Senior/Staff)

Builds and maintains Python-based tooling for managing large-scale GPU server fleets, including provisioning, health monitoring, AI-driven recovery, Linux tuning, and security hardening. Requires 3+ years managing server fleets at scale with strong Python and Linux expertise.

180k – 250kSan Francisco, CADevOps / SREHybridNfsLvm

Senior/Staff Virtualization Engineer

Builds custom compute environments including bare metal/virtual machines with GPU passthrough, dedicated Kubernetes clusters, and overlay networking for customer AI workloads. Requires 5+ years in Linux virtualization, strong networking, Kubernetes on bare metal, and NVIDIA GPU expertise.

180k – 250kSan Francisco, CADevOps / SREOn-site5+ YOEBGPVfio