Skip to content

Senior Manager, Site Reliability Engineering (Federal)

207k – 285kWashington, DCHybrid
Summary

Lead and mentor multiple SRE teams overseeing Edge networking, Kubernetes platform, CI/CD, observability, and automation tooling for Okta’s high-scale SaaS infrastructure on AWS.

About the role

What you'll be doing

  • Managing a team of SRE’s supporting our various workloads operating in private sector environments.
  • Drive the microservice journey, DevOps maturity, and workload reliability in tandem with architects and teams across the organization.
  • Accelerate the velocity of SRE and product engineering by developing powerful tooling, intuitive self-service capabilities, and robust self-healing patterns.
  • Lead, mentor, and grow a high-performing team of engineers and managers across platform, infrastructure, and shared services domains.
  • Perform engineering design evaluations and ensure the completion of projects within resource, budget, and scheduling constraints.
  • Improve SDLC processes for Cloud infrastructure as a code, including the maturity of CI/CD pipelines, change and release management.
  • Manage service and business expectations and prioritize resource allocation.
  • Maintain a deep knowledge of industry best practices, evolving trends, and technologies.

What you’ll bring to the role

  • 3+ years of experience in technical leadership & people management.
  • Extensive experience using Agile and DevOps methodologies to build product infrastructure and shared service at scale.
  • Experience running large-scale infrastructure platforms supporting a SaaS/Cloud service in a public Cloud, preferably AWS. Experience supporting a multi-Cloud environment will be a plus.
  • Strong expertise in cloud-native architectures, containerization (Kubernetes), IaC (Terraform), and CI/CD pipelines.
  • Strong background and hands-on experience in SW development, PaaS and automation.
  • Deep experience with building and operating observability platforms and monitoring tools (Grafana, Splunk, APM etc.) in a large scale environment.
  • Effective verbal, written communication and interpersonal skills.
  • Computer Science Degree or related degree or equivalent experience.

Additional requirements

  • This position requires the ability to access federal environments and/or have access to protected federal data. As a condition of employment for this position, the successful candidate must be able to submit documentation establishing U.S. Person status (e.g. a U.S. Citizen, National, Lawful Permanent Resident, Refugee, or Asylee. 22 CFR 120.15) upon hire.
Skills
KubernetesTerraformAWSCI/CDGrafanaSplunkDevOpsObservabilityInfrastructure as CodeAgile
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Aurelian

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS
Watershed

Software Engineer, Developer Tooling

Software engineer building developer tooling, AI automation, and test infrastructure to improve productivity and reliability for Watershed engineering teams.

174k – 230kSan Francisco, CADevOps / SREOn-siteCI/CDTemporal