Skip to content

Senior Production Engineer, Operational Excellence

172k – 209kSan Francisco, CASunnyvale, CAOnsite5+ YOE
Summary

Senior Production Engineer ensures reliability, scalability, and performance of GPU cloud infrastructure powering AI workloads. Drives observability, incident response, automation, and operational improvements in large-scale distributed systems.

About the role

Responsibilities

  • Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs
  • Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis
  • Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry
  • Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems
  • Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure
  • Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities
  • Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organization

Requirements

  • 5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations
  • Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems
  • Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space
  • Previous experience in Infrastructure roles building or managing compute, storage or networking platforms
  • Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP)
  • Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar)
  • Experience with monitoring and observability tools such as Prometheus and Grafana
  • Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible
  • Scripting or programming experience with languages such as Go, Python, C, or C++
  • Strong communication skills and the ability to collaborate across engineering teams
  • Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments

Nice-to-Haves

  • Experience working with Kubernetes or container orchestration platforms at scale
  • Exposure to change management processes, operational readiness reviews, or structured root cause analysis
  • Experience designing self-healing systems, automated remediation, or event-driven operational tooling
  • Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments

Compensation

  • Base salary range: $172,000 – $209,000 + Bonus
  • Restricted Stock Units included
Skills
PrometheusGrafanaAlertmanagerOpenTelemetryKubernetesLinuxTerraformAnsiblePythonGoAWSGCPSREGPU workloadsHPC
Similar roles at this salary range
All DevOps / SRE jobs →
Aurelian

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Aurelian

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Huntress

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS