Senior Manager, Site Reliability Engineering (Federal)
207k – 285kWashington, DCHybrid
Summary
Lead and mentor multiple SRE teams overseeing Edge networking, Kubernetes platform, CI/CD, observability, and automation tooling for Okta’s high-scale SaaS infrastructure on AWS.
About the role
What you'll be doing
- Managing a team of SRE’s supporting our various workloads operating in private sector environments.
- Drive the microservice journey, DevOps maturity, and workload reliability in tandem with architects and teams across the organization.
- Accelerate the velocity of SRE and product engineering by developing powerful tooling, intuitive self-service capabilities, and robust self-healing patterns.
- Lead, mentor, and grow a high-performing team of engineers and managers across platform, infrastructure, and shared services domains.
- Perform engineering design evaluations and ensure the completion of projects within resource, budget, and scheduling constraints.
- Improve SDLC processes for Cloud infrastructure as a code, including the maturity of CI/CD pipelines, change and release management.
- Manage service and business expectations and prioritize resource allocation.
- Maintain a deep knowledge of industry best practices, evolving trends, and technologies.
What you’ll bring to the role
- 3+ years of experience in technical leadership & people management.
- Extensive experience using Agile and DevOps methodologies to build product infrastructure and shared service at scale.
- Experience running large-scale infrastructure platforms supporting a SaaS/Cloud service in a public Cloud, preferably AWS. Experience supporting a multi-Cloud environment will be a plus.
- Strong expertise in cloud-native architectures, containerization (Kubernetes), IaC (Terraform), and CI/CD pipelines.
- Strong background and hands-on experience in SW development, PaaS and automation.
- Deep experience with building and operating observability platforms and monitoring tools (Grafana, Splunk, APM etc.) in a large scale environment.
- Effective verbal, written communication and interpersonal skills.
- Computer Science Degree or related degree or equivalent experience.
Additional requirements
- This position requires the ability to access federal environments and/or have access to protected federal data. As a condition of employment for this position, the successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.
Skills
KubernetesTerraformAWSCI/CDGrafanaSplunkDevOpsObservabilityInfrastructure as CodeAgile
Similar roles at this salary range
All DevOps / SRE jobs →Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.
195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS