# Site Reliability Engineer (Mid/Senior/Staff)
**Company:** [Fal](https://hotfix.jobs/companies/fal)
**Location:** San Francisco, CA
**Salary:** $180K-$250K
**Experience:** 5+ years
**Skills:** Kubernetes, Terraform, Ansible, Prometheus, Grafana, CI/CD, GitOps, Fluxcd, Argo CD, Python, Go, Linux Networking, Ebpf, Cilium, Datadog
**Posted:** 2026-04-30
> Owns and operates Kubernetes infrastructure at scale, builds CI/CD pipelines, automates incident resolution with AI, defines SLOs, and drives reliability through chaos engineering. Requires 5+ years managing production systems, Kubernetes expertise, and monitoring tools like Prometheus/Grafana.
## Job Description
## Key Responsibilities

- Own and operate our Kubernetes infrastructure: cluster lifecycle, upgrades, networking, and multi-tenant isolation for customer workloads
- Build and maintain CI/CD pipelines and deployment infrastructure
- Leverage AI to an extreme level to automate analysis and resolution of production issues, and improve software development speed, reliability and maintainability
- Build dashboards, alerting, and anomaly detection across our systems
- Define and enforce SLOs and build out incident response processes
- Manage and improve our networking, load balancing, and service mesh configurations
- Drive reliability improvements across the stack through automation, runbooks, and chaos engineering

## Requirements

- 5+ years experience in managing critical production systems and software development workflows
- Strong production experience setting up and operating Kubernetes at scale, using infrastructure-as-code (Terraform, Ansible)
- Deep knowledge of Linux networking, container networking (CNI plugins, VXLAN, BGP), and DNS
- Experience building CI/CD systems and GitOps workflows (FluxCD, ArgoCD)
- Proficiency in Python and either Go or Bash for tooling and automation
- Strong experience with logging, monitoring and alerting (Prometheus, Grafana, Loki, Thanos, VictoriaMetrics, Datadog)
- Excellent communication and ability to drive technical decisions across teams
- Self-starter who executes quickly, takes ownership, and constantly seeks improvement

## Nice to have

- Experience with managing GPU and AI/ML workloads
- Experience with kernel-based monitoring and routing (eBPF, XDP)
- Experience with security tooling (Falco, Coroot, SIEM)
- Experience with bare metal Kubernetes networking (Calico, Cilium, MetalLB)
- Experience with distributed storage systems (Ceph, Longhorn, etc.)

## Compensation

$180,000-250,000 plus equity + benefits (Range is based across 3 levels)
**Apply:** https://hotfix.jobs/jobs/site-reliability-engineer-mid-senior-staff-at-fal-537f7e7b-f169-4712-b094-28b2b537b28c
**Canonical:** https://hotfix.jobs/jobs/site-reliability-engineer-mid-senior-staff-at-fal-537f7e7b-f169-4712-b094-28b2b537b28c