Skip to content

Team Lead, Site Reliability Engineering - Storage Layer Service

Leads a team of SREs for MongoDB's Storage Layer Services, defining SLOs, capacity plans, and roadmaps for multi-tenant distributed storage systems underpinning Atlas. Requires 10+ years in distributed systems and 2+ years managing teams, with expertise in Kubernetes and IaC tools.

151k – 297kBoston, MACharlotte, NCNew York, NY+3 moreDevOps / SREHybrid10+ YOE

About the role

Responsibilities

  • Build and lead a team of 6-8 engineers, fostering a positive culture, handling career growth and performance conversations, and proactively removing blockers
  • Define and drive a clear technical vision and comprehensive roadmap for our multi-tenant distributed storage systems, balancing long-term strategic infrastructure goals with immediate engineering needs
  • Contribute through hands-on technical work, such as leading architectural design reviews, reviewing PRs, and stepping in to guide the team through complex operational challenges
  • Act as the primary liaison for the Storage Layer Services SRE team, collaborating closely with other engineering leaders to ensure platform alignment and manage stakeholder expectations

Requirements

  • 10+ years of experience working on software and operating distributed systems, with 2+ years managing engineering teams
  • Customer-focused mindset, treating internal developers as your primary users
  • Value efficiency in processes and operations, and have a track record of optimizing team workflows
  • Prefer automation over manual processes, fostering a culture of building software solutions to eliminate toil
  • Deep technical familiarity with Kubernetes ecosystems, containerization technologies, and modern IaC tooling (e.g., Terraform, Crossplane, or Operators)
  • Operated or supported stateful storage or database systems at scale and comfortable with durability, consistency and recovery trade-offs
  • Excel at translating complex business and engineering requirements into actionable, phased technical roadmaps
  • High level of empathy, responsibility, ownership, and accountability
  • Excellent verbal and written technical communication skills

Nice-to-Haves

  • Leading major architectural shifts, such as moving from legacy storage stacks to new multi-tenant storage architectures, including planning and executing large-scale data and workload migrations with tight availability and durability requirements
  • Managing and scaling infrastructure across multi-cloud environments (AWS, GCP, or Azure)
  • Designing secure, multi-tenant runtime environments at scale

Skills

KubernetesTerraformCrossplaneAWSGCPAzureDistributed SystemsContainerizationIacStorage SystemsMongoDBOperators

Similar roles

DevOps / SRE jobs

Senior Asset Pipeline Engineer

Design and own the OpenUSD-based asset pipeline for a high-fidelity sensor simulation platform. Build automated DCC-to-engine pipelines, custom schemas, material conversion, and validation systems at library scale.

151k – 230kSunnyvale, CADevOps / SREOn-site5+ YOEMdlCI/CD

Senior Platform Engineer, Interoperability

Lead development of scalable platform systems and infrastructure tools that enable internal and external developers to build faster, more reliable applications. Requires 5+ years of software engineering experience with 3+ years in Node.js.

151k – 205kSan Francisco, CADevOps / SREHybrid5+ YOEEs6Git

Senior Platform Engineer - Kubernetes

Senior Platform Engineer responsible for designing, operating, and scaling Kubernetes clusters on AWS. Focuses on CI/CD, infrastructure automation, and developer productivity across WHOOP's technology stacks.

150k – 215kBoston, MADevOps / SREHybrid5+ YOEC#AWS

Senior Platform Engineer

Senior Platform Engineer building self-service network platforms and reliable infrastructure across DNS, edge, ingress, and service mesh for a fast-growing cybersecurity company.

150k – 210kUnited StatesDevOps / SRERemote5+ YOEGoDNS

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-site4+ YOELoggingClickHouse