Skip to content

Site Reliability Engineer

100k – 300kEmeryville, CADevOps / SREHybrid
Summary

Owns digital infrastructure for AI research, managing compute access, auto-scaling, resource visibility, and reproducibility using Kubernetes and observability tools. Requires systems intuition, operational rigor, and pragmatism for experimental workloads.

About the role

Responsibilities

  • Own digital infrastructure powering research, including compute resources from third parties, container registries, and dashboards.
  • Ensure easy and efficient sharing of resources, reliability, and accessibility.
  • Provide compute access, resource visibility into utilization and cluster health.
  • Enable auto-scaling of compute resources based on demand.
  • Manage access to ensure right people have appropriate permissions.
  • Drive deterministic deployments and reproducible research environments.
  • Automate operational processes for efficiency.

Current stack: Ansible, Kubernetes, Docker, Tailscale, Python, Grafana, Prometheus, Talos Linux.

Qualifications

  • Ownership: Comfortable being accountable for cluster health and capacity.
  • Systems Intuition: Understand schedulers, containers, networking, storage, hardware interactions; reason about failure modes.
  • Operational Rigor: Value observability, reproducibility, clear boundaries; leave understandable systems.
  • Pragmatism: Support experimental workloads without rigid production constraints.
Skills
KubernetesDockerAnsiblePrometheusGrafanaPythonTailscaleTalos Linux
Similar roles at this salary range
All DevOps / SRE jobs →
Openly

Site Reliability Engineer II (Remote, US)

DevOps/SRE II building and maintaining infrastructure for an insurance platform using GCP, Kubernetes, and Terraform. Focus on automation, monitoring, incident response, and security best practices.

115k – 173kUnited StatesDevOps / SRERemote2+ YOEGoCI/CD
Pinterest

Site Reliability Engineer II

Operate and scale a cloud-native CTV advertising platform on AWS and Kubernetes. Focus on reliability, GitOps workflows, infrastructure automation, observability, and incident response.

114k – 235kSan Francisco, CADevOps / SRERemote4+ YOEAWSEKS
CommandLink

Senior Network Engineer

Senior Network Engineer building and supporting carrier interconnects, private circuits, NNIs, and cloud connectivity for a managed network services provider. Requires hands-on service provider experience with Layer 2/3 protocols and direct carrier coordination.

120k – 160kUnited StatesDevOps / SRERemote5+ YOEBGPVRF
Nuro

Software Reliability Engineer

Build and operate resilient systems for Nuro's autonomous vehicle fleet. Design pipelines, automation, and tools to improve reliability and reduce operational toil. Join on-call rotation and lead investigations.

109k – 163kMountain View, CADevOps / SREOn-siteGoC++
Kraken

Site Reliability Engineer - AI Agents

Design, build, and operate reliable infrastructure for AI agent workflows and model serving on AWS and Kubernetes. Build platform APIs, SDKs, and self-service tooling while ensuring observability and incident response for production AI systems.

96k – 192kUnited StatesDevOps / SRERemote5+ YOEAWSBash