Skip to content

OS / K8s Systems Engineer

165k – 330kSan Francisco, CANew York, NYDevOps / SREHybrid
Summary

Build automation and systems to provision and orchestrate GPU hardware into scalable Kubernetes clusters. Requires deep Linux expertise, provisioning experience, and strong programming in Python/Go.

About the role

Responsibilities

  • Own the end-to-end automation of cluster bring-up and lifecycle management.
  • Build and maintain OS images, provisioning systems, and configuration pipelines.
  • Deploy and operate cluster orchestration platforms (Kubernetes, Slurm, or similar).
  • Design systems for reproducibility across sites and hardware generations.
  • Automate upgrades, rollouts, and failure recovery.
  • Optimize system performance, including GPU utilization and networking.
  • Partner with hardware and network teams to validate and improve system behavior.

Requirements

  • Experience building and operating automated infrastructure systems.
  • Strong programming skills (Python, Go, or similar).
  • Deep familiarity with Linux systems, including boot processes, drivers, and performance.
  • Experience with provisioning systems (PXE, imaging, configuration management).
  • Experience with Kubernetes.
  • Strong debugging skills across system layers (hardware → OS → network).
  • Experience working with GPU or high-performance workloads is a plus.
Skills
KubernetesLinuxPythonGoPXEGPUProvisioningConfiguration ManagementSlurmDebugging
Similar roles at this salary range
All DevOps / SRE jobs →
Northwood Space

Senior Network Engineer

Design, deploy, and operate enterprise network infrastructure for corporate facilities and hybrid cloud environments with zero-trust architecture and compliance requirements. Requires 5+ years enterprise networking experience and ability to obtain TS/SCI clearance.

133k – 215kLos Angeles, CA +1DevOps / SREOn-site5+ YOEAWSVLAN
Fivetran

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP
Dropbox

Senior Infrastructure Software Engineer, Storage Core

Senior engineer building and operating Dropbox's exabyte-scale distributed storage systems. Focus on replication, erasure coding, performance, and reliability in Go/Rust.

180k – 274kUnited StatesDevOps / SRERemote9+ YOEGoC++
Beacon AI

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue
Okta

Staff Site Reliability Engineer - Observability

Staff SRE focused on building and scaling a comprehensive observability platform on GCP using Terraform, Splunk, and Grafana. Requires 5+ years GCP observability experience and strong coding skills in Python or Go.

194k – 267kBellevue, WA +4DevOps / SREHybrid5+ YOEGoGKE