Staff Site Reliability Engineer, Kubernetes w/ active TS/SCI
Senior SRE focused on Kubernetes-orchestrated cloud infrastructure for high-stakes national security environments. Manages reliability, incidents, automation, and scalability with active TS/SCI clearance and 5+ years Kubernetes experience.
What You’ll Do
- Infrastructure Excellence: Design, deploy, and monitor Okta’s production infrastructure to ensure peak performance and reliability.
- Incident Management: Serve as a frontline responder to production incidents, performing deep-dive troubleshooting and implementing permanent preventive solutions.
- Aggressive Automation: Eliminate manual toil by developing automation scripts, evolving monitoring tools, and documenting technical workflows.
- Scalability: Support a highly available, large-scale environment as part of an on-call rotation, ensuring "Always On" service delivery.
What You’ll Bring
Core Requirements
- Clearance & Citizenship: Active TS/SCI clearance.
- Federal Compliance: Deep familiarity with FedRAMP and DoD IL6 compliance standards.
- Education: B.S. in Computer Science or equivalent professional experience.
Technical Expertise
- Kubernetes Mastery: 5+ years of experience building and operating workloads orchestrated by Kubernetes, including expert-level debugging of Helm values and charts.
- Systems & Scripting: Strong Linux systems administration background with proficiency in Go, Python, Bash, or Ruby.
- Cloud Infrastructure: Expertise in AWS services (EC2, ECS, KMS, CloudWatch) and Infrastructure as Code (Terraform or CloudFormation).
- Production Support: Experience managing Docker containers and web applications (Java/Apache/Tomcat) in high-traffic live environments.
Networking: Solid understanding of networking concepts and IP protocols; experience with multi-cloud environments is a significant plus.
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Senior Developer Experience Engineer
Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.