Senior Production Engineer, Operational Excellence
Senior Production Engineer ensures reliability, scalability, and performance of GPU cloud infrastructure powering AI workloads. Drives observability, incident response, automation, and operational improvements in large-scale distributed systems.
Responsibilities
- Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
- Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis
- Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
- Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
- Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure
- Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
- Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organization
Requirements
- 5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
- Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
- Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
- Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
- Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
- Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
- Experience with monitoring and observability tools such as Prometheus and Grafana
- Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
- Scripting or programming experience with languages such as Go, Python, C, or C++
- Strong communication skills and the ability to collaborate across engineering teams
- Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments
Nice-to-Haves
- Experience working with Kubernetes or container orchestration platforms at scale
- Exposure to change management processes, operational readiness reviews, or structured root cause analysis
- Experience designing self-healing systems, automated remediation, or event-driven operational tooling
- Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments
Compensation
- Base salary range: $172,000 – $209,000 + Bonus
- Restricted Stock Units included
Senior Infrastructure Engineer
Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.
Lead Site Reliability Engineer
Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.
Senior Developer Experience Engineer
Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.
Staff Network Engineer, Operations
Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.