Skip to content

Lead Site Reliability Engineer

200k – 275kSan Francisco, CAOnsite7+ YOE
Summary

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

About the role

What You’ll Do

  • Set the Reliability Strategy: define the long-term vision for site reliability, including SLOs/SLIs, error budgets, availability targets, and operational standards.
  • Build & Scale Reliable Infrastructure: architect and maintain resilient, scalable cloud infrastructure across AWS and Kubernetes, ensuring systems are secure, fault-tolerant, and cost-effective.
  • Own Observability & Monitoring: design and evolve monitoring, alerting, and logging systems that provide clear, actionable signals across services and environments.
  • Lead Incident Response & Postmortems: own incident management practices, lead major incident response, and drive blameless postmortems that result in meaningful system improvements.
  • Improve System Resilience: identify reliability risks and lead efforts around redundancy, failover, capacity planning, and graceful degradation.
  • Optimize CI/CD & Deployment Reliability: partner with engineering teams to ensure deployments are safe, observable, and reversible; improve rollout strategies and reduce operational risk.
  • Partner with Product & Engineering Teams: collaborate early in the development lifecycle to influence system design, scalability, and reliability tradeoffs.
  • Reduce Toil & Improve Developer Experience: automate operational tasks, improve runbooks, and build tooling that reduces manual work and accelerates safe execution.
  • Drive Root Cause Resolution: guide teams through deep debugging of reliability issues, ensuring fixes address underlying causes rather than symptoms.
  • Influence Reliability Culture: promote reliability-first thinking, strong operational hygiene, and shared ownership of production systems across engineering.
  • Mentor & Level Up the Team: coach engineers on reliability principles, incident handling, infrastructure design, and operational best practices.

You Might Be a Fit If You…

  • Have 7+ years of experience in site reliability engineering, infrastructure engineering, or backend software engineering.
  • Have designed and operated highly available, production-grade systems supporting rapid product iteration.
  • Are fluent in Python and/or TypeScript, and comfortable building automation and tooling to support reliability goals.
  • Have deep experience with AWS, Kubernetes (EKS), Docker, and cloud-native architectures.
  • Have implemented and evolved observability stacks (metrics, logs, traces) and know how to create high-signal alerting.
  • Understand how to design, measure, and enforce SLOs, SLIs, and error budgets.
  • Have supported systems built with modern stacks such as FastAPI, Vue.js, PostgreSQL (RDS), and event-driven architectures.
  • Have improved reliability and operational maturity in environments using CI/CD pipelines, infrastructure as code, and modern deployment workflows.
  • Can balance reliability, velocity, and cost — making pragmatic tradeoffs that serve customers and the business.
  • Enjoy collaborating across Product, Backend, Frontend, and Infrastructure teams to improve system health.
  • Thrive in a role that blends deep technical execution, system design, and leadership influence in a fast-moving environment.

Compensation

  • Top-of-market salary and equity package
  • Benefits (for U.S.-based full-time employees): Medical, dental & vision insurance coverage for you; 401(k) & Match; Equity; Flexible PTO; Parental Leave
Skills
PythonTypeScriptAWSKubernetesEKSDockerFastAPIVue.jsPostgreSQLRDSCI/CDInfrastructure as CodeObservabilitySLOsSLIs
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Aurelian

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Huntress

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS
Watershed

Software Engineer, Developer Tooling

Software engineer building developer tooling, AI automation, and test infrastructure to improve productivity and reliability for Watershed engineering teams.

174k – 230kSan Francisco, CADevOps / SREOn-siteCI/CDTemporal