Senior Production Engineer, Operational Excellence

172k – 209kSan Francisco, CASunnyvale, CAOnsite5+ YOEMay 5

Summary

Senior Production Engineer ensures reliability, scalability, and performance of GPU cloud infrastructure powering AI workloads. Drives observability, incident response, automation, and operational improvements in large-scale distributed systems.

About the role

Responsibilities

Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis
Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure
Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organization

Requirements

5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
Experience with monitoring and observability tools such as Prometheus and Grafana
Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
Scripting or programming experience with languages such as Go, Python, C, or C++
Strong communication skills and the ability to collaborate across engineering teams
Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments

Nice-to-Haves

Experience working with Kubernetes or container orchestration platforms at scale
Exposure to change management processes, operational readiness reviews, or structured root cause analysis
Experience designing self-healing systems, automated remediation, or event-driven operational tooling
Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments

Compensation

Base salary range: $172,000 – $209,000 + Bonus
Restricted Stock Units included

Skills

PrometheusGrafanaAlertmanagerOpenTelemetryKubernetesLinuxTerraformAnsiblePythonGoAWSGCPSREGPU workloadsHPC

Similar roles at this salary range

All DevOps / SRE jobs →

Aurelian

Jun 8

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Aurelian

Jun 8

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Huntress

Jun 8

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby

Crusoe

Jun 5

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS

Apply