Skip to content

Staff Site Reliability Engineer - Observability

194k – 267kSan Francisco, CAHybrid5+ YOE
Summary

Builds and scales observability platform on GCP using Terraform, Python/Go for automation. Requires 5+ years GKE, 3+ years SRE, expertise in Splunk/Grafana dashboards, Kubernetes, and distributed systems.

About the role

Key Responsibilities

  • Automated Infrastructure: Design, build, and maintain scalable observability infrastructure using tools like Terraform.
  • GCP Observability Engineering: Optimize collection, processing, and storage of observability data for high reliability and low latency in Splunk and Grafana services.
  • Incident Response: Participate in on-call rotations and lead post-incident reviews to drive systemic improvements.
  • Automation: Eliminate toil by automating deployment and scaling of observability agents and collectors.

Required Skills & Experience

  • GKE: Minimum 5+ years scaling and managing observability in Google Cloud platform.
  • Visualization: Expertise in Splunk or Grafana dashboards correlating data across sources.
  • SRE Mindset: Minimum 3+ years in SRE, DevOps, or Systems Engineering with high-availability systems.
  • Programming Proficiency: Strong skills in Python, Go for building tools and automating workflows.
  • Distributed Systems: Deep understanding of Linux internals, networking (TCP/IP, DNS, Load Balancing), and container orchestration (Kubernetes/GKE).
  • Problem Solving: Data-driven debugging of complex performance bottlenecks.

Bonus Skills

  • Telemetry Standards: Experience with OpenTelemetry (OTel), Vector.
  • Grafana Loki: Experience migrating Splunk to Grafana Loki.
  • Other Cloud Platforms: Managing observability tools in AWS.

Compensation

Annual base salary range: $194,000 - $267,000 USD (San Francisco Bay Area), plus equity, bonus, and benefits.

Skills
Google CloudGKETerraformSplunkGrafanaKubernetesPythonGoLinuxOpenTelemetry
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Aurelian

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Huntress

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS