Senior Platform Reliability Engineer
Senior Platform Reliability Engineer establishing reliability standards, observability, and incident response practices across engineering teams. Requires 6+ years operating production systems at scale with AWS, Kubernetes, Terraform, and modern observability tooling.
What You'll Work On
You’ll help us establish and scale reliability as a discipline at Grow by:
Defining Reliability Standards
- Establishing frameworks for SLOs/SLAs, error budgets, and operational readiness
- Helping teams understand what to measure and why it matters
Improving Observability & Measurement
- Identifying gaps in metrics, logging, and tracing
- Ensuring services are measurable, debuggable, and aligned with reliability goals
Evolving Incident Response
- Developing and improving incident response practices, from detection to post-incident learning
- Helping teams build sustainable on-call and escalation patterns
Enabling Self-Service Reliability
- Partnering with the platform team to build tooling and abstractions (e.g., service scorecards, dashboards, templates, golden paths)
- Making it easy for teams to adopt and stay compliant with reliability standards
Driving Adoption Across Teams
- Working cross-functionally to educate, influence, and guide engineering teams
- Scaling reliability practices through clear standards, strong communication, and developer-friendly systems
Who You Are
- 6+ years of experience operating and improving reliability of production systems at scale
- Hands-on experience with AWS, Kubernetes (e.g., EKS), and infrastructure as code tools like Terraform
- Experience defining or working with SLOs/SLAs, error budgets, and improving reliability through measurement and iteration
- Experience with modern observability tooling (DataDog) and building actionable monitoring systems across metrics, logs, and traces
- Ability to zoom out, identify patterns across teams and services, and design solutions that scale beyond a single system
- Focus on outcomes over output and care deeply about improving real reliability outcomes
- Strong communicator and influencer who can drive change across teams without direct authority
- Self-directed and comfortable defining problems, proposing solutions, and executing independently
- Collaborative team player who communicates with empathy and enjoys mentoring and learning from others
Bonus Points
- Helped introduce or scale reliability practices in a growing organization
- Built internal tooling or platforms used by multiple teams
- Experience designing service-level scorecards or compliance/reporting systems
- Worked with both SaaS (e.g., DataDog) and self-managed observability stacks
- Previously a product engineer bringing empathy for developer experience
- Experience with database reliability and performance (PostgreSQL)
Staff Site Reliability Engineer, Release Engineering
Staff SRE on the Release Engineering team defining and scaling reliability practices, architecting SLO/error-budget programs, and driving progressive delivery and automated safety gates across product engineering.
Staff Site Reliability Engineer - Observability
Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.