Staff Site Reliability Engineer, Kubernetes w/ active TS/SCI

188k – 235kWashington, DCHybrid5+ YOEFeb 10

Summary

Senior SRE focused on Kubernetes-orchestrated cloud infrastructure for high-stakes national security environments. Manages reliability, incidents, automation, and scalability with active TS/SCI clearance and 5+ years Kubernetes experience.

About the role

What You’ll Do

Infrastructure Excellence: Design, deploy, and monitor Okta’s production infrastructure to ensure peak performance and reliability.
Incident Management: Serve as a frontline responder to production incidents, performing deep-dive troubleshooting and implementing permanent preventive solutions.
Aggressive Automation: Eliminate manual toil by developing automation scripts, evolving monitoring tools, and documenting technical workflows.
Scalability: Support a highly available, large-scale environment as part of an on-call rotation, ensuring "Always On" service delivery.

What You’ll Bring

Core Requirements

Clearance & Citizenship: Active TS/SCI clearance.
Federal Compliance: Deep familiarity with FedRAMP and DoD IL6 compliance standards.
Education: B.S. in Computer Science or equivalent professional experience.

Technical Expertise

Kubernetes Mastery: 5+ years of experience building and operating workloads orchestrated by Kubernetes, including expert-level debugging of Helm values and charts.
Systems & Scripting: Strong Linux systems administration background with proficiency in Go, Python, Bash, or Ruby.
Cloud Infrastructure: Expertise in AWS services (EC2, ECS, KMS, CloudWatch) and Infrastructure as Code (Terraform or CloudFormation).
Production Support: Experience managing Docker containers and web applications (Java/Apache/Tomcat) in high-traffic live environments.

Networking: Solid understanding of networking concepts and IP protocols; experience with multi-cloud environments is a significant plus.

Skills

KubernetesHelmAWSTerraformDockerLinuxPythonGoBashFedRAMP

Similar roles at this salary range

All DevOps / SRE jobs →

Crusoe

Jun 8

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit

Aurelian

Jun 8

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Huntress

Jun 8

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby

Crusoe

Jun 5

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS

Apply