Site Reliability Engineer

100k – 300kEmeryville, CADevOps / SREHybridJan 29

Summary

Owns digital infrastructure for AI research, managing compute access, auto-scaling, resource visibility, and reproducibility using Kubernetes and observability tools. Requires systems intuition, operational rigor, and pragmatism for experimental workloads.

About the role

Responsibilities

Own digital infrastructure powering research, including compute resources from third parties, container registries, and dashboards.
Ensure easy and efficient sharing of resources, reliability, and accessibility.
Provide compute access, resource visibility into utilization and cluster health.
Enable auto-scaling of compute resources based on demand.
Manage access to ensure right people have appropriate permissions.
Drive deterministic deployments and reproducible research environments.
Automate operational processes for efficiency.

Current stack: Ansible, Kubernetes, Docker, Tailscale, Python, Grafana, Prometheus, Talos Linux.

Qualifications

Ownership: Comfortable being accountable for cluster health and capacity.
Systems Intuition: Understand schedulers, containers, networking, storage, hardware interactions; reason about failure modes.
Operational Rigor: Value observability, reproducibility, clear boundaries; leave understandable systems.
Pragmatism: Support experimental workloads without rigid production constraints.

Skills

KubernetesDockerAnsiblePrometheusGrafanaPythonTailscaleTalos Linux

Similar roles at this salary range

All DevOps / SRE jobs →

Openly

Jun 22

Site Reliability Engineer II (Remote, US)

DevOps/SRE II building and maintaining infrastructure for an insurance platform using GCP, Kubernetes, and Terraform. Focus on automation, monitoring, incident response, and security best practices.

115k – 173kUnited StatesDevOps / SRERemote2+ YOEGoCI/CD

Jun 18

Site Reliability Engineer II

Operate and scale a cloud-native CTV advertising platform on AWS and Kubernetes. Focus on reliability, GitOps workflows, infrastructure automation, observability, and incident response.

114k – 235kSan Francisco, CADevOps / SRERemote4+ YOEAWSEKS

CommandLink

Jun 17

Senior Network Engineer

Senior Network Engineer building and supporting carrier interconnects, private circuits, NNIs, and cloud connectivity for a managed network services provider. Requires hands-on service provider experience with Layer 2/3 protocols and direct carrier coordination.

120k – 160kUnited StatesDevOps / SRERemote5+ YOEBGPVRF

Nuro

Jun 12

Software Reliability Engineer

Build and operate resilient systems for Nuro's autonomous vehicle fleet. Design pipelines, automation, and tools to improve reliability and reduce operational toil. Join on-call rotation and lead investigations.

109k – 163kMountain View, CADevOps / SREOn-siteGoC++

Kraken

Jun 11

Site Reliability Engineer - AI Agents

Design, build, and operate reliable infrastructure for AI agent workflows and model serving on AWS and Kubernetes. Build platform APIs, SDKs, and self-service tooling while ensuring observability and incident response for production AI systems.

96k – 192kUnited StatesDevOps / SRERemote5+ YOEAWSBash

Apply