Skip to content

Site Reliability Engineer

Builds and operates reliable, scalable AI infrastructure including observability, SLOs, incident response, automation, and performance tuning for ultra-low-latency serverless compute. Requires 3+ years SRE/DevOps experience with cloud, Kubernetes, programming (Go/Rust/Python), and observability tools.

175k – 250kSan Francisco, CADevOps / SREOnsite3+ YOE

About the role

Responsibilities

  • Architect, operate, and continuously improve the core infrastructure powering our 25ms cold-start compute engine.
  • Build and evolve our observability stack (metrics, traces, logs), ensuring we detect issues before users do.
  • Define, monitor, and drive SLOs/SLIs across key system surfaces to maintain world-class reliability.
  • Lead incident response with rigor: root cause analysis, post-mortems, and driving systemic fixes.
  • Design and implement self-healing, automated operational systems to eliminate toil and scale ops.
  • Work across compute, networking, storage, and sandboxed execution layers to tune performance under extreme workloads.
  • Build automation and tooling—often with AI agents—to streamline operations, debugging, capacity planning, and failure prediction.
  • Stress-test and push our systems to the edge: load testing, chaos engineering, and performance benchmarking.
  • Own security best practices at the infrastructure layer, from sandboxed compute to network isolation.
  • Partner with platform engineers to ensure reliability is designed into new features from day one.

Requirements

Required skills:

  • 3+ years in SRE, DevOps, or infrastructure engineering roles
  • Strong proficiency in at least one programming language such as Go, Rust, or Python
  • Hands-on experience with a major cloud provider (AWS, GCP)
  • Solid knowledge of Linux systems, networking fundamentals, and distributed systems
  • Experience with bare-metal servers and datacenter operations (PXE/iPXE provisioning, IPMI/BMC, RAID/NVMe, SR-IOV, high-throughput networking)
  • Experience with Kubernetes or similar orchestrators
  • Familiarity with observability stacks (Prometheus, Grafana, ELK, Datadog)
  • Experience building and maintaining CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
  • Strong debugging, problem-solving, and incident-management skills

Preferred:

  • Experience with infrastructure-as-code tools such as Terraform or Pulumi
  • Knowledge of service mesh or API gateway technologies
  • Exposure to chaos engineering or resiliency-testing frameworks
  • Background in security best practices for cloud environments
  • Prior experience in high-growth or high-availability environments

Bonus:

  • Experience with serverless compute systems
  • Sandboxed execution environments
  • Ultra-low-latency runtime engineering
  • Distributed key-value stores and databases
  • Chaos engineering
  • Rust, Go, or systems-level programming
  • Deep generative AI infrastructure

Skills

KubernetesPrometheusGrafanaDatadogTerraformPulumiGoRustPythonAWSGCPLinuxCI/CDGitHub ActionsGitlab Ci

Similar roles

DevOps / SRE jobs

Production Engineer, IaaS

Own observability, API surface, and control plane for a hyperscale AI compute fleet. Build production-grade data pipelines, stateful APIs, and Kubernetes infrastructure that other teams depend on.

175k – 300kSan Francisco, CA +3DevOps / SREOn-site5+ YOEGoPython

Production Engineer, Compute

Own end-to-end health, repair automation, and qualification of a hyperscale GPU/TPU compute fleet. Build metrics pipelines, firmware tooling, and self-healing repair workflows across Kubernetes and bare metal.

175k – 300kSan Francisco, CA +3DevOps / SREHybrid5+ YOEGoBmc

SWE - Backend Infrastructure Engineer

Builds and scales core infrastructure including ML training/serving, Kubernetes clusters, and low-latency voice/audio pipelines. Requires 3+ years in infrastructure/ML systems, hands-on reliability engineering, and Kubernetes expertise.

175k – 280kSan Francisco, CA +2DevOps / SREOn-site3+ YOEAPIsSeldon

Infrastructure Engineer / SRE

Designs and operates large-scale infrastructure for secure, scalable AI agent runtimes, untrusted code execution, and multi-cloud deployments. Requires strong expertise in distributed systems, containers, Kubernetes, and security.

175k – 275kNew York, NYDevOps / SREOn-siteVpcKeda

Platform Engineer

Owns and builds core parts of the Forge platform end-to-end, including IDE, compiler, runtime, and infra for AI-powered English programming language. Requires staff+ engineer experience building products 0-1 with deep technical craft.

175k – 225kSan Francisco, CA +1DevOps / SREOn-siteIdesLLMs