Senior Manager, Site Reliability Engineering (Federal)

207k – 285kWashington, DCHybridMay 28

Summary

Lead and mentor multiple SRE teams overseeing Edge networking, Kubernetes platform, CI/CD, observability, and automation tooling for Okta’s high-scale SaaS infrastructure on AWS.

About the role

What you'll be doing

Managing a team of SRE’s supporting our various workloads operating in private sector environments.
Drive the microservice journey, DevOps maturity, and workload reliability in tandem with architects and teams across the organization.
Accelerate the velocity of SRE and product engineering by developing powerful tooling, intuitive self-service capabilities, and robust self-healing patterns.
Lead, mentor, and grow a high-performing team of engineers and managers across platform, infrastructure, and shared services domains.
Perform engineering design evaluations and ensure the completion of projects within resource, budget, and scheduling constraints.
Improve SDLC processes for Cloud infrastructure as a code, including the maturity of CI/CD pipelines, change and release management.
Manage service and business expectations and prioritize resource allocation.
Maintain a deep knowledge of industry best practices, evolving trends, and technologies.

What you’ll bring to the role

3+ years of experience in technical leadership & people management.
Extensive experience using Agile and DevOps methodologies to build product infrastructure and shared service at scale.
Experience running large-scale infrastructure platforms supporting a SaaS/Cloud service in a public Cloud, preferably AWS. Experience supporting a multi-Cloud environment will be a plus.
Strong expertise in cloud-native architectures, containerization (Kubernetes), IaC (Terraform), and CI/CD pipelines.
Strong background and hands-on experience in SW development, PaaS and automation.
Deep experience with building and operating observability platforms and monitoring tools (Grafana, Splunk, APM etc.) in a large scale environment.
Effective verbal, written communication and interpersonal skills.
Computer Science Degree or related degree or equivalent experience.

Additional requirements

This position requires the ability to access federal environments and/or have access to protected federal data. As a condition of employment for this position, the successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.

Skills

KubernetesTerraformAWSCI/CDGrafanaSplunkDevOpsObservabilityInfrastructure as CodeAgile

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 8

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit

Aurelian

Jun 8

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Crusoe

Jun 5

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS

Watershed

Jun 5

Software Engineer, Developer Tooling

Software engineer building developer tooling, AI automation, and test infrastructure to improve productivity and reliability for Watershed engineering teams.

174k – 230kSan Francisco, CADevOps / SREOn-siteCI/CDTemporal

Apply