Skip to content

Senior Site Reliability Engineer (SRE)

Senior SRE builds and maintains scalable infrastructure, mentors on observability best practices (SLIs/SLOs), handles incident response, and automates tools for engineering teams. Requires 5+ years with observability tools like Prometheus, OpenTelemetry, and Kubernetes.

Somerville, MADevOps / SREHybrid5+ YOE

About the role

Key Responsibilities

  • Mentor and evangelize on observability best practices, SLIs/SLOs, and reliability culture across engineering teams.
  • Help architect our systems for growth and scale.
  • Implement internal tools to automate common developer tasks.
  • Perform incident response and debug production issues across the entire stack.
  • Design, build, and maintain the core infrastructure used by all of Tulip’s engineering teams.
  • Work to automate detection and resolution of recurring issues.

Skills Required

  • 5+ years of experience working with open source Observability tools (e.g. LGTM stack)
  • Hands-on experience instrumenting distributed systems using OpenTelemetry and managing metrics pipelines with Prometheus at scale.
  • Experience working with time-series data, ideally using promQL
  • Ability to pick up new languages/frameworks with ease. Currently run Go and Typescript services on Kubernetes.

About You

  • Experience building and maintaining stable infrastructure at scale.
  • Can reason about systems — their edge cases, failure modes, and life cycles.
  • Excited about setting the technical agenda and coming up with novel, broad ideas.
  • Can debug complex issues across the entire stack.
  • Opinionated about the tools and frameworks that work best.
  • Enjoys building for other engineers equally, if not more, than building for a customer.
  • Knows what a good SLA looks like, and can teach others how to spot one.
  • Can communicate as well as you can code. Understands the value of discussion and work best in a team that champions clear and frequent communication.

Skills

PrometheusOpenTelemetryKubernetesGoTypeScriptPromqlLgtm StackGrafanaLokiTempo

Similar roles

DevOps / SRE jobs

Senior Site Reliability Engineer

Senior Site Reliability Engineer building and operating highly reliable, scalable Kubernetes-based cloud services in Okta's Emerging Products Group. Lead incident response, define SLOs, develop automation in Go/Python/Terraform, improve observability, and mentor on reliability best practices.

San Francisco, CADevOps / SREHybrid5+ YOEGoAWS

Senior Software Engineer, Infrastructure

Senior engineer building and standardizing AWS/GCP cloud infrastructure, networking, and self-service tooling for Coinbase's multi-cloud platform.

186k – 219kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Senior Software Engineer - Snowpark Container Service

Senior engineer to design, build, and lead development of Snowpark Container Services, a Kubernetes-based container compute platform. Requires 7+ years building large-scale distributed systems and strong coding skills in Java, C++, or Go.

200k – 288kBellevue, WADevOps / SREHybrid7+ YOEGoC++

Senior DevOps Engineer

Senior DevOps Engineer building and operating Kubernetes-based ephemeral environments and cloud infrastructure on AWS to improve developer productivity and platform reliability.

153k – 231kUnited StatesDevOps / SRERemote4+ YOEGoAWS

Senior Site Reliability Engineer - Government Cloud

Build and operate AWS GovCloud infrastructure for federal customers, owning IaC, container pipelines, compliance documentation, and operational tooling. Requires 5+ years AWS experience and FedRAMP familiarity.

210k – 220kUnited StatesDevOps / SRERemote5+ YOEAWSCdk