Skip to content

Operations Engineer, Fleet Reliability

Hands-on operations engineer maintaining GPU clusters (B300, H200, H100), troubleshooting hardware/software issues, monitoring fleet health, and automating runbooks. Requires Linux admin, GPU debugging, observability tools experience, and on-call comfort.

United StatesDevOps / SRERemote

About the role

Responsibilities

  • Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters
  • Troubleshoot hardware and software issues across compute, network, and storage
  • Monitor fleet health, take remediation action, push fixes upstream when needed
  • Write the runbooks. Improve the ones that exist. Delete the ones that don't work

Requirements

  • Administered Linux Systems in the critical path before
  • Troubleshooted GPU node issues: NVLink, NCCL, IB, driver and firmware bugs
  • Has experience in observability systems like Grafana and Prometheus
  • Scripted your way out of repetitive work (bash, python, go, whatever)
  • Curious. You don't accept "it's flaky" as a root cause
  • Comfortable with ambiguity. The runbook doesn't exist yet for half of what you'll do
  • On-call doesn't scare you
  • You'd rather automate a problem than fix it twice

Skills

LinuxGPUNvlinkNcclInfiniBandGrafanaPrometheusBashPythonGo

Similar roles

DevOps / SRE jobs

Software Engineer, Services Platform

Build platform primitives for service provisioning, deploy tooling, workflow orchestration, and service ownership at a fast-scaling AI coding tool company. Requires experience with durable workflows like Temporal, internal dev platforms, and strong focus on developer experience and reliability.

San Francisco, CA +1DevOps / SREOn-site5+ YOECI/CDTemporal

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud and LLM infrastructure powering RAG, inference, and data pipelines for an aviation AI platform. Requires strong AWS depth, Python data pipelines, and production LLM experience.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSVpc

Software Engineer, Traffic

Design, build, and operate scalable distributed systems and edge networks on AWS to handle Figma's growing customer traffic and services. Requires 4+ years building infrastructure at scale, experience with TypeScript or Go, and distributed/traffic systems.

153k – 376kSan Francisco, CA +1DevOps / SRERemote4+ YOEGoAWS

Cloud Engineer - Product Metrics

Design, build, and operate petabyte-scale distributed systems for product metrics using Golang, Kubernetes, and ClickHouse. Requires 5+ years building scalable systems and 2+ years with Golang.

141k – 230kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Postgres Deployment Engineer

Own stability and deployment of PostgreSQL products. Package software with Nix, manage upgrades, optimize CI/CD, and resolve production issues. Requires 3+ years PostgreSQL experience and Nix proficiency.

United StatesDevOps / SRERemote3+ YOECGo