Operations Engineer, Fleet Reliability

Hands-on operations engineer maintaining GPU clusters (B300, H200, H100), troubleshooting hardware/software issues, monitoring fleet health, and automating runbooks. Requires Linux admin, GPU debugging, observability tools experience, and on-call comfort.

United StatesDevOps / SRERemote

Apply

About the role

Responsibilities

Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters
Troubleshoot hardware and software issues across compute, network, and storage
Monitor fleet health, take remediation action, push fixes upstream when needed
Write the runbooks. Improve the ones that exist. Delete the ones that don't work

Requirements

Administered Linux Systems in the critical path before
Troubleshooted GPU node issues: NVLink, NCCL, IB, driver and firmware bugs
Has experience in observability systems like Grafana and Prometheus
Scripted your way out of repetitive work (bash, python, go, whatever)
Curious. You don't accept "it's flaky" as a root cause
Comfortable with ambiguity. The runbook doesn't exist yet for half of what you'll do
On-call doesn't scare you
You'd rather automate a problem than fix it twice

Skills

LinuxGPUNvlinkNcclInfiniBandGrafanaPrometheusBashPythonGo

Similar roles

DevOps / SRE jobs

Cursor

Software Engineer, Services Platform

Build platform primitives for service provisioning, deploy tooling, workflow orchestration, and service ownership at a fast-scaling AI coding tool company. Requires experience with durable workflows like Temporal, internal dev platforms, and strong focus on developer experience and reliability.

San Francisco, CA +1DevOps / SREOn-site5+ YOECI/CDTemporal

Beacon AI

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud and LLM infrastructure powering RAG, inference, and data pipelines for an aviation AI platform. Requires strong AWS depth, Python data pipelines, and production LLM experience.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSVpc

Figma

Software Engineer, Traffic

Design, build, and operate scalable distributed systems and edge networks on AWS to handle Figma's growing customer traffic and services. Requires 4+ years building infrastructure at scale, experience with TypeScript or Go, and distributed/traffic systems.

153k – 376kSan Francisco, CA +1DevOps / SRERemote4+ YOEGoAWS

Clickhouse

Cloud Engineer - Product Metrics

Design, build, and operate petabyte-scale distributed systems for product metrics using Golang, Kubernetes, and ClickHouse. Requires 5+ years building scalable systems and 2+ years with Golang.

141k – 230kUnited StatesDevOps / SRERemote5+ YOEGoAWS

Supabase

Postgres Deployment Engineer

Own stability and deployment of PostgreSQL products. Package software with Nix, manage upgrades, optimize CI/CD, and resolve production issues. Requires 3+ years PostgreSQL experience and Nix proficiency.

United StatesDevOps / SRERemote3+ YOECGo