Skip to content

Senior Platform Reliability Engineer

182k – 250kSan Francisco, CANew York, NYSeattle, WADevOps / SREHybrid6+ YOE
Summary

Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.

About the role

What You'll Work On

You’ll help us establish and scale reliability as a discipline at Grow by:

Defining Reliability Standards

  • Establishing frameworks for SLOs/SLAs, error budgets, and operational readiness
  • Helping teams understand what to measure and why it matters

Improving Observability & Measurement

  • Identifying gaps in metrics, logging, and tracing
  • Ensuring services are measurable, debuggable, and aligned with reliability goals

Evolving Incident Response

  • Developing and improving incident response practices, from detection to post-incident learning
  • Helping teams build sustainable on-call and escalation patterns

Enabling Self-Service Reliability

  • Partnering with the platform team to build tooling and abstractions (e.g., service scorecards, dashboards, templates, golden paths)
  • Making it easy for teams to adopt and stay compliant with reliability standards

Driving Adoption Across Teams

  • Working cross-functionally to educate, influence, and guide engineering teams
  • Scaling reliability practices through clear standards, strong communication, and developer-friendly systems

Who You Are

  • 6+ years of experience operating and improving reliability of production systems at scale
  • Hands-on experience with AWS, Kubernetes (e.g., EKS), and infrastructure as code tools like Terraform
  • Experience defining or working with SLOs/SLAs, error budgets, and improving reliability through measurement and iteration
  • Experience with modern observability tooling (DataDog) and building actionable monitoring systems across metrics, logs, and traces
  • Ability to zoom out, identify patterns across teams and services, and design solutions that scale beyond a single system
  • Focus on outcomes over output and care deeply about improving real reliability outcomes
  • Strong communicator and influencer who can drive change across teams without direct authority
  • Self-directed and comfortable defining problems, proposing solutions, and executing independently
  • Collaborative team player who communicates with empathy and enjoys mentoring and learning from others

Bonus Points

  • Helped introduce or scale reliability practices in a growing organization
  • Built internal tooling or platforms used by multiple teams
  • Experience designing service-level scorecards or compliance/reporting systems
  • Worked with both SaaS (e.g., DataDog) and self-managed observability stacks
  • Previously a product engineer bringing empathy for developer experience
  • Experience with database reliability and performance (PostgreSQL)
Skills
AWSKubernetesEKSTerraformSLOsSLAsError BudgetsDatadogObservabilityPostgreSQL
Similar roles at this salary range
All DevOps / SRE jobs →
Plaid

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE
Cribl

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3