# Site Reliability Engineer
**Company:** [Blaxel](https://hotfix.jobs/companies/blaxel)
**Location:** San Francisco, CA
**Salary:** $175K-$250K
**Experience:** 3+ years
**Skills:** Kubernetes, Prometheus, Grafana, Datadog, Terraform, Pulumi, Go, Rust, Python, AWS, GCP, Linux, CI/CD, GitHub Actions, Gitlab Ci
**Posted:** 2026-03-03
> Builds and operates reliable, scalable AI infrastructure including observability, SLOs, incident response, automation, and performance tuning for ultra-low-latency serverless compute. Requires 3+ years SRE/DevOps experience with cloud, Kubernetes, programming (Go/Rust/Python), and observability tools.
## Job Description
## Responsibilities

- Architect, operate, and continuously improve the core infrastructure powering our 25ms cold-start compute engine.
- Build and evolve our observability stack (metrics, traces, logs), ensuring we detect issues before users do.
- Define, monitor, and drive SLOs/SLIs across key system surfaces to maintain world-class reliability.
- Lead incident response with rigor: root cause analysis, post-mortems, and driving systemic fixes.
- Design and implement self-healing, automated operational systems to eliminate toil and scale ops.
- Work across compute, networking, storage, and sandboxed execution layers to tune performance under extreme workloads.
- Build automation and tooling—often with AI agents—to streamline operations, debugging, capacity planning, and failure prediction.
- Stress-test and push our systems to the edge: load testing, chaos engineering, and performance benchmarking.
- Own security best practices at the infrastructure layer, from sandboxed compute to network isolation.
- Partner with platform engineers to ensure reliability is designed into new features from day one.

## Requirements

**Required skills:**
- 3+ years in SRE, DevOps, or infrastructure engineering roles
- Strong proficiency in at least one programming language such as **Go**, **Rust**, or **Python**
- Hands-on experience with a major cloud provider (**AWS**, **GCP**)
- Solid knowledge of **Linux** systems, networking fundamentals, and distributed systems
- Experience with bare-metal servers and datacenter operations (PXE/iPXE provisioning, IPMI/BMC, RAID/NVMe, SR-IOV, high-throughput networking)
- Experience with **Kubernetes** or similar orchestrators
- Familiarity with observability stacks (**Prometheus**, **Grafana**, **ELK**, **Datadog**)
- Experience building and maintaining **CI/CD** pipelines (**GitHub Actions**, **GitLab CI**, **Jenkins**)
- Strong debugging, problem-solving, and incident-management skills

**Preferred:**
- Experience with infrastructure-as-code tools such as **Terraform** or **Pulumi**
- Knowledge of service mesh or API gateway technologies
- Exposure to chaos engineering or resiliency-testing frameworks
- Background in security best practices for cloud environments
- Prior experience in high-growth or high-availability environments

**Bonus:**
- Experience with serverless compute systems
- Sandboxed execution environments
- Ultra-low-latency runtime engineering
- Distributed key-value stores and databases
- Chaos engineering
- **Rust**, **Go**, or systems-level programming
- Deep generative AI infrastructure
**Apply:** https://hotfix.jobs/jobs/site-reliability-engineer-at-blaxel-d04240e3-4b33-406e-acc5-592de1a1fa94
**Canonical:** https://hotfix.jobs/jobs/site-reliability-engineer-at-blaxel-d04240e3-4b33-406e-acc5-592de1a1fa94