Site Reliability Engineer

Builds and operates reliable, scalable AI infrastructure including observability, SLOs, incident response, automation, and performance tuning for ultra-low-latency serverless compute. Requires 3+ years SRE/DevOps experience with cloud, Kubernetes, programming (Go/Rust/Python), and observability tools.

175k – 250kSan Francisco, CADevOps / SREOnsite3+ YOE

Apply

About the role

Responsibilities

Architect, operate, and continuously improve the core infrastructure powering our 25ms cold-start compute engine.
Build and evolve our observability stack (metrics, traces, logs), ensuring we detect issues before users do.
Define, monitor, and drive SLOs/SLIs across key system surfaces to maintain world-class reliability.
Lead incident response with rigor: root cause analysis, post-mortems, and driving systemic fixes.
Design and implement self-healing, automated operational systems to eliminate toil and scale ops.
Work across compute, networking, storage, and sandboxed execution layers to tune performance under extreme workloads.
Build automation and tooling—often with AI agents—to streamline operations, debugging, capacity planning, and failure prediction.
Stress-test and push our systems to the edge: load testing, chaos engineering, and performance benchmarking.
Own security best practices at the infrastructure layer, from sandboxed compute to network isolation.
Partner with platform engineers to ensure reliability is designed into new features from day one.

Requirements

Required skills:

3+ years in SRE, DevOps, or infrastructure engineering roles
Strong proficiency in at least one programming language such as Go, Rust, or Python
Hands-on experience with a major cloud provider (AWS, GCP)
Solid knowledge of Linux systems, networking fundamentals, and distributed systems
Experience with bare-metal servers and datacenter operations (PXE/iPXE provisioning, IPMI/BMC, RAID/NVMe, SR-IOV, high-throughput networking)
Experience with Kubernetes or similar orchestrators
Familiarity with observability stacks (Prometheus, Grafana, ELK, Datadog)
Experience building and maintaining CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
Strong debugging, problem-solving, and incident-management skills

Preferred:

Experience with infrastructure-as-code tools such as Terraform or Pulumi
Knowledge of service mesh or API gateway technologies
Exposure to chaos engineering or resiliency-testing frameworks
Background in security best practices for cloud environments
Prior experience in high-growth or high-availability environments

Bonus:

Experience with serverless compute systems
Sandboxed execution environments
Ultra-low-latency runtime engineering
Distributed key-value stores and databases
Chaos engineering
Rust, Go, or systems-level programming
Deep generative AI infrastructure

Skills

KubernetesPrometheusGrafanaDatadogTerraformPulumiGoRustPythonAWSGCPLinuxCI/CDGitHub ActionsGitlab Ci

Similar roles

DevOps / SRE jobs

Fluidstack

Production Engineer, IaaS

Own observability, API surface, and control plane for a hyperscale AI compute fleet. Build production-grade data pipelines, stateful APIs, and Kubernetes infrastructure that other teams depend on.

175k – 300kSan Francisco, CA +3DevOps / SREOn-site5+ YOEGoPython

Fluidstack

Production Engineer, Compute

Own end-to-end health, repair automation, and qualification of a hyperscale GPU/TPU compute fleet. Build metrics pipelines, firmware tooling, and self-healing repair workflows across Kubernetes and bare metal.

175k – 300kSan Francisco, CA +3DevOps / SREHybrid5+ YOEGoBmc

Sesame

SWE - Backend Infrastructure Engineer

Builds and scales core infrastructure including ML training/serving, Kubernetes clusters, and low-latency voice/audio pipelines. Requires 3+ years in infrastructure/ML systems, hands-on reliability engineering, and Kubernetes expertise.

175k – 280kSan Francisco, CA +2DevOps / SREOn-site3+ YOEAPIsSeldon

Superblocks

Infrastructure Engineer / SRE

Designs and operates large-scale infrastructure for secure, scalable AI agent runtimes, untrusted code execution, and multi-cloud deployments. Requires strong expertise in distributed systems, containers, Kubernetes, and security.

175k – 275kNew York, NYDevOps / SREOn-siteVpcKeda

Poetic

Platform Engineer

Owns and builds core parts of the Forge platform end-to-end, including IDE, compiler, runtime, and infra for AI-powered English programming language. Requires staff+ engineer experience building products 0-1 with deep technical craft.

175k – 225kSan Francisco, CA +1DevOps / SREOn-siteIdesLLMs