Senior Site Reliability Engineer (SRE)

Senior SRE builds and maintains scalable infrastructure, mentors on observability best practices (SLIs/SLOs), handles incident response, and automates tools for engineering teams. Requires 5+ years with observability tools like Prometheus, OpenTelemetry, and Kubernetes.

Somerville, MADevOps / SREHybrid5+ YOE

Apply

About the role

Key Responsibilities

Mentor and evangelize on observability best practices, SLIs/SLOs, and reliability culture across engineering teams.
Help architect our systems for growth and scale.
Implement internal tools to automate common developer tasks.
Perform incident response and debug production issues across the entire stack.
Design, build, and maintain the core infrastructure used by all of Tulip’s engineering teams.
Work to automate detection and resolution of recurring issues.

Skills Required

5+ years of experience working with open source Observability tools (e.g. LGTM stack)
Hands-on experience instrumenting distributed systems using OpenTelemetry and managing metrics pipelines with Prometheus at scale.
Experience working with time-series data, ideally using promQL
Ability to pick up new languages/frameworks with ease. Currently run Go and Typescript services on Kubernetes.

About You

Experience building and maintaining stable infrastructure at scale.
Can reason about systems — their edge cases, failure modes, and life cycles.
Excited about setting the technical agenda and coming up with novel, broad ideas.
Can debug complex issues across the entire stack.
Opinionated about the tools and frameworks that work best.
Enjoys building for other engineers equally, if not more, than building for a customer.
Knows what a good SLA looks like, and can teach others how to spot one.
Can communicate as well as you can code. Understands the value of discussion and work best in a team that champions clear and frequent communication.

Skills

PrometheusOpenTelemetryKubernetesGoTypeScriptPromqlLgtm StackGrafanaLokiTempo

Similar roles

DevOps / SRE jobs

Okta

Senior Site Reliability Engineer

Senior Site Reliability Engineer building and operating highly reliable, scalable Kubernetes-based cloud services in Okta's Emerging Products Group. Lead incident response, define SLOs, develop automation in Go/Python/Terraform, improve observability, and mentor on reliability best practices.

San Francisco, CADevOps / SREHybrid5+ YOEGoAWS

Coinbase

Senior Software Engineer, Infrastructure

Senior engineer building and standardizing AWS/GCP cloud infrastructure, networking, and self-service tooling for Coinbase's multi-cloud platform.

186k – 219kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Snowflake

Senior Software Engineer - Snowpark Container Service

Senior engineer to design, build, and lead development of Snowpark Container Services, a Kubernetes-based container compute platform. Requires 7+ years building large-scale distributed systems and strong coding skills in Java, C++, or Go.

200k – 288kBellevue, WADevOps / SREHybrid7+ YOEGoC++

Upstart

Senior DevOps Engineer

Senior DevOps Engineer building and operating Kubernetes-based ephemeral environments and cloud infrastructure on AWS to improve developer productivity and platform reliability.

153k – 231kUnited StatesDevOps / SRERemote4+ YOEGoAWS

Tines

Senior Site Reliability Engineer - Government Cloud

Build and operate AWS GovCloud infrastructure for federal customers, owning IaC, container pipelines, compliance documentation, and operational tooling. Requires 5+ years AWS experience and FedRAMP familiarity.

210k – 220kUnited StatesDevOps / SRERemote5+ YOEAWSCdk