Lead Site Reliability Engineer
200k – 275kSan Francisco, CAOnsite7+ YOE
Summary
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
About the role
What You’ll Do
- Set the Reliability Strategy: define the long-term vision for site reliability, including SLOs/SLIs, error budgets, availability targets, and operational standards.
- Build & Scale Reliable Infrastructure: architect and maintain resilient, scalable cloud infrastructure across AWS and Kubernetes, ensuring systems are secure, fault-tolerant, and cost-effective.
- Own Observability & Monitoring: design and evolve monitoring, alerting, and logging systems that provide clear, actionable signals across services and environments.
- Lead Incident Response & Postmortems: own incident management practices, lead major incident response, and drive blameless postmortems that result in meaningful system improvements.
- Improve System Resilience: identify reliability risks and lead efforts around redundancy, failover, capacity planning, and graceful degradation.
- Optimize CI/CD & Deployment Reliability: partner with engineering teams to ensure deployments are safe, observable, and reversible; improve rollout strategies and reduce operational risk.
- Partner with Product & Engineering Teams: collaborate early in the development lifecycle to influence system design, scalability, and reliability tradeoffs.
- Reduce Toil & Improve Developer Experience: automate operational tasks, improve runbooks, and build tooling that reduces manual work and accelerates safe execution.
- Drive Root Cause Resolution: guide teams through deep debugging of reliability issues, ensuring fixes address underlying causes rather than symptoms.
- Influence Reliability Culture: promote reliability-first thinking, strong operational hygiene, and shared ownership of production systems across engineering.
- Mentor & Level Up the Team: coach engineers on reliability principles, incident handling, infrastructure design, and operational best practices.
You Might Be a Fit If You…
- Have 7+ years of experience in site reliability engineering, infrastructure engineering, or backend software engineering.
- Have designed and operated highly available, production-grade systems supporting rapid product iteration.
- Are fluent in Python and/or TypeScript, and comfortable building automation and tooling to support reliability goals.
- Have deep experience with AWS, Kubernetes (EKS), Docker, and cloud-native architectures.
- Have implemented and evolved observability stacks (metrics, logs, traces) and know how to create high-signal alerting.
- Understand how to design, measure, and enforce SLOs, SLIs, and error budgets.
- Have supported systems built with modern stacks such as FastAPI, Vue.js, PostgreSQL (RDS), and event-driven architectures.
- Have improved reliability and operational maturity in environments using CI/CD pipelines, infrastructure as code, and modern deployment workflows.
- Can balance reliability, velocity, and cost — making pragmatic tradeoffs that serve customers and the business.
- Enjoy collaborating across Product, Backend, Frontend, and Infrastructure teams to improve system health.
- Thrive in a role that blends deep technical execution, system design, and leadership influence in a fast-moving environment.
Compensation
- Top-of-market salary and equity package
- Benefits (for U.S.-based full-time employees): Medical, dental & vision insurance coverage for you; 401(k) & Match; Equity; Flexible PTO; Parental Leave
Skills
PythonTypeScriptAWSKubernetesEKSDockerFastAPIVue.jsPostgreSQLRDSCI/CDInfrastructure as CodeObservabilitySLOsSLIs
Similar roles at this salary range
All DevOps / SRE jobs →Senior Developer Experience Engineer
Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.
160k – 190kUnited StatesDevOps / SRERemoteGoRuby
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.
195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS