OS / K8s Systems Engineer

165k – 330kSan Francisco, CANew York, NYDevOps / SREHybridMay 1

Summary

Build automation and systems to provision and orchestrate GPU hardware into scalable Kubernetes clusters. Requires deep Linux expertise, provisioning experience, and strong programming in Python/Go.

About the role

Responsibilities

Own the end-to-end automation of cluster bring-up and lifecycle management.
Build and maintain OS images, provisioning systems, and configuration pipelines.
Deploy and operate cluster orchestration platforms (Kubernetes, Slurm, or similar).
Design systems for reproducibility across sites and hardware generations.
Automate upgrades, rollouts, and failure recovery.
Optimize system performance, including GPU utilization and networking.
Partner with hardware and network teams to validate and improve system behavior.

Requirements

Experience building and operating automated infrastructure systems.
Strong programming skills (Python, Go, or similar).
Deep familiarity with Linux systems, including boot processes, drivers, and performance.
Experience with provisioning systems (PXE, imaging, configuration management).
Experience with Kubernetes.
Strong debugging skills across system layers (hardware → OS → network).
Experience working with GPU or high-performance workloads is a plus.

Skills

KubernetesLinuxPythonGoPXEGPUProvisioningConfiguration ManagementSlurmDebugging

Similar roles at this salary range

All DevOps / SRE jobs →

Northwood Space

Jun 19

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Dropbox

Jun 18

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++

Beacon AI

Jun 17

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue

Okta

Jun 17

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE

Apply