Team Lead, Site Reliability Engineering - Storage Layer Service

Leads a team of SREs for MongoDB's Storage Layer Services, defining SLOs, capacity plans, and roadmaps for multi-tenant distributed storage systems underpinning Atlas. Requires 10+ years in distributed systems and 2+ years managing teams, with expertise in Kubernetes and IaC tools.

151k – 297kBoston, MACharlotte, NCNew York, NY+3 moreDevOps / SREHybrid10+ YOE

Apply

About the role

Responsibilities

Build and lead a team of 6-8 engineers, fostering a positive culture, handling career growth and performance conversations, and proactively removing blockers
Define and drive a clear technical vision and comprehensive roadmap for our multi-tenant distributed storage systems, balancing long-term strategic infrastructure goals with immediate engineering needs
Contribute through hands-on technical work, such as leading architectural design reviews, reviewing PRs, and stepping in to guide the team through complex operational challenges
Act as the primary liaison for the Storage Layer Services SRE team, collaborating closely with other engineering leaders to ensure platform alignment and manage stakeholder expectations

Requirements

10+ years of experience working on software and operating distributed systems, with 2+ years managing engineering teams
Customer-focused mindset, treating internal developers as your primary users
Value efficiency in processes and operations, and have a track record of optimizing team workflows
Prefer automation over manual processes, fostering a culture of building software solutions to eliminate toil
Deep technical familiarity with Kubernetes ecosystems, containerization technologies, and modern IaC tooling (e.g., Terraform, Crossplane, or Operators)
Operated or supported stateful storage or database systems at scale and comfortable with durability, consistency and recovery trade-offs
Excel at translating complex business and engineering requirements into actionable, phased technical roadmaps
High level of empathy, responsibility, ownership, and accountability
Excellent verbal and written technical communication skills

Nice-to-Haves

Leading major architectural shifts, such as moving from legacy storage stacks to new multi-tenant storage architectures, including planning and executing large-scale data and workload migrations with tight availability and durability requirements
Managing and scaling infrastructure across multi-cloud environments (AWS, GCP, or Azure)
Designing secure, multi-tenant runtime environments at scale

Skills

KubernetesTerraformCrossplaneAWSGCPAzureDistributed SystemsContainerizationIacStorage SystemsMongoDBOperators

Similar roles

DevOps / SRE jobs

Applied Intuition

Senior Asset Pipeline Engineer

Design and own the OpenUSD-based asset pipeline for a high-fidelity sensor simulation platform. Build automated DCC-to-engine pipelines, custom schemas, material conversion, and validation systems at library scale.

151k – 230kSunnyvale, CADevOps / SREOn-site5+ YOEMdlCI/CD

Drata

Senior Platform Engineer, Interoperability

Lead development of scalable platform systems and infrastructure tools that enable internal and external developers to build faster, more reliable applications. Requires 5+ years of software engineering experience with 3+ years in Node.js.

151k – 205kSan Francisco, CADevOps / SREHybrid5+ YOEEs6Git

WHOOP

Senior Platform Engineer - Kubernetes

Senior Platform Engineer responsible for designing, operating, and scaling Kubernetes clusters on AWS. Focuses on CI/CD, infrastructure automation, and developer productivity across WHOOP's technology stacks.

150k – 215kBoston, MADevOps / SREHybrid5+ YOEC#AWS

1Password

Senior Platform Engineer

Senior Platform Engineer building self-service network platforms and reliable infrastructure across DNS, edge, ingress, and service mesh for a fast-growing cybersecurity company.

150k – 210kUnited StatesDevOps / SRERemote5+ YOEGoDNS

Aurelian

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-site4+ YOELoggingClickHouse