Senior Platform Reliability Engineer

182k – 250kSan Francisco, CANew York, NYSeattle, WADevOps / SREHybrid6+ YOEJun 16

Summary

Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.

About the role

What You'll Work On

You’ll help us establish and scale reliability as a discipline at Grow by:

Defining Reliability Standards

Establishing frameworks for SLOs/SLAs, error budgets, and operational readiness
Helping teams understand what to measure and why it matters

Improving Observability & Measurement

Identifying gaps in metrics, logging, and tracing
Ensuring services are measurable, debuggable, and aligned with reliability goals

Evolving Incident Response

Developing and improving incident response practices, from detection to post-incident learning
Helping teams build sustainable on-call and escalation patterns

Enabling Self-Service Reliability

Partnering with the platform team to build tooling and abstractions (e.g., service scorecards, dashboards, templates, golden paths)
Making it easy for teams to adopt and stay compliant with reliability standards

Driving Adoption Across Teams

Working cross-functionally to educate, influence, and guide engineering teams
Scaling reliability practices through clear standards, strong communication, and developer-friendly systems

Who You Are

6+ years of experience operating and improving reliability of production systems at scale
Hands-on experience with AWS, Kubernetes (e.g., EKS), and infrastructure as code tools like Terraform
Experience defining or working with SLOs/SLAs, error budgets, and improving reliability through measurement and iteration
Experience with modern observability tooling (DataDog) and building actionable monitoring systems across metrics, logs, and traces
Ability to zoom out, identify patterns across teams and services, and design solutions that scale beyond a single system
Focus on outcomes over output and care deeply about improving real reliability outcomes
Strong communicator and influencer who can drive change across teams without direct authority
Self-directed and comfortable defining problems, proposing solutions, and executing independently
Collaborative team player who communicates with empathy and enjoys mentoring and learning from others

Bonus Points

Helped introduce or scale reliability practices in a growing organization
Built internal tooling or platforms used by multiple teams
Experience designing service-level scorecards or compliance/reporting systems
Worked with both SaaS (e.g., DataDog) and self-managed observability stacks
Previously a product engineer bringing empathy for developer experience
Experience with database reliability and performance (PostgreSQL)

Skills

AWSKubernetesEKSTerraformSLOsSLAsError BudgetsDatadogObservabilityPostgreSQL

Similar roles at this salary range

All DevOps / SRE jobs →

Plaid

Jun 19

Staff Site Reliability Engineer, Release Engineering

Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.

208k – 274kNew York, NYDevOps / SREHybrid8+ YOEGoSLO

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Cribl

Jun 17

Sr Software Engineer, Storage

Senior Software Engineer on the Storage team building autoscaling, self-healing infrastructure-as-code systems that manage petabyte-scale telemetry storage on AWS.

175k – 205kUnited StatesDevOps / SRERemote5+ YOEGoS3

Apply