Staff Site Reliability Engineer, Core IDaaS w/ active TS/SCI
Leads SRE for Core IDaaS in federal air-gapped environments, designing AWS infrastructure with Terraform/Helm/Go, managing incidents, and ensuring compliance (FedRAMP/IL6). Requires 6+ years SRE experience, TS/SCI clearance, and bachelor's degree.
What You’ll Do
- Cloud & Air-Gapped Infrastructure: Design and deliver AWS-based projects, primarily writing Terraform, Helm, and Go, and adapting deployments for secure federal air-gapped environments.
- Incident Management: Respond and remediate production incidents, performing deep-dive troubleshooting, driving rapid response, and implementing permanent preventive solutions.
- Engineering Standards: Drive high-quality code and operational rigor through design reviews, code reviews, attention to detail, and deep technical expertise.
- Technical Leadership: Mentor junior engineers and collaborate with cross-functional teams to deliver secure, enterprise-grade solutions.
What You’ll Bring
Core Qualifications
- Security Clearance: Active U.S. TS/SCI clearance.
- Compliance Expertise: Proven experience navigating Federal and DoD compliance frameworks, specifically FedRAMP and Impact Level 6 (IL6).
- Domain Authority: Deep expertise in architecting, deploying, and optimizing software within federal air-gapped environments.
- Education: Bachelor’s degree in Computer Science or a related technical field (Master’s degree preferred).
Technical Expertise
- Site Reliability Engineering: 6+ years of professional experience running production cloud workloads at scale. 4+ years of experience developing and troubleshooting web services on Kubernetes or similar orchestration layers.
- Broad Database Knowledge: Significant hands-on experience with both relational and non-relational datastores.
- Interactive and Batch Workloads: Demonstrated success delivering and maintaining both customer-facing interactive workloads and large-scale batch processing – ideally using data warehouse products such as Snowflake, Redshift, or Databricks.
- Security & Engineering: Strong foundational knowledge of network security (authentication/authorization) and a commitment to rigorous software engineering best practices.
- Industry Experience: Prior experience supporting or building mission-critical Enterprise SaaS platforms.
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Senior Developer Experience Engineer
Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.