Skip to content

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

250k – 300kSan Francisco, CAOnsite8+ YOE
Summary

Design and operate multi-petabyte distributed storage systems for large-scale AI training and inference, integrating parallel filesystems and building Kubernetes-native storage platforms.

About the role

Responsibilities

  • Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing).
  • Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage.
  • Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns.
  • Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes.
  • Build multi-tier caches (local NVMe, distributed, object); optimize data locality and model-weight distribution; implement smart prefetching/eviction.
  • Implement monitoring, alerting, SLOs; design DR/backups with runbooks; run chaos engineering; ensure 99.9%+ uptime via proactive/automated remediation.
  • Partner with ML/SRE teams; mentor on storage best practices; contribute to open-source; write docs, postmortems, and public learnings.

Requirements

  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
  • Programming: Go and Python for automation, operators, and tooling
  • Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
  • Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
  • Observability: Prometheus, Grafana, Thanos architecture and operations

Nice-to-Haves

  • GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
  • ML/AI storage patterns (model weights, checkpointing, dataset caching)
  • Kubernetes operator development (controller-runtime, kubebuilder)
  • Storage snapshots, cloning, and thin provisioning
  • Backup and disaster recovery (Velero, Restic, cross-region replication)
  • Storage encryption (at-rest and in-transit), security and compliance
  • Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)
Skills
WekaFSCephLustreKubernetesGoPythonTerraformAnsibleHelmArgoCDPrometheusGrafanaRDMAInfiniBandNVMe-oF
Similar roles at this salary range
All DevOps / SRE jobs →
Crusoe

Staff Software Engineer, Developer Experience

Staff-level engineer building developer tools, infrastructure, and automation to accelerate Crusoe engineering productivity. Requires Go, Kubernetes, CI/CD, and strong DevOps/SRE experience.

209k – 253kSan Francisco, CA +1DevOps / SREOn-siteGoGit
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Snowflake

Senior Software Engineer - Internal Observability

Senior engineer building AI-powered observability systems and large-scale telemetry pipelines for Snowflake's multi-cloud data platform. Requires 7+ years focused on distributed systems and cloud services.

200k – 288kMenlo Park, CADevOps / SREOn-siteC++AWS
Kepler

Platform Engineer

Own AWS infrastructure, Pulumi IaC, deployment pipelines, and security baseline for an AI research platform serving financial institutions. First dedicated platform hire defining enterprise deployment, SOC 2 controls, and developer experience.

200k – 280kNew York, NYDevOps / SREOn-siteAWSCDK
Onebrief

Principal Infrastructure Engineer

Principal Infrastructure Engineer building and operating secure cloud-native and edge platforms for military collaboration software. Requires 8+ years production infrastructure experience, deep Kubernetes expertise, and ability to obtain SECRET clearance.

235k – 275kUnited StatesDevOps / SRERemoteGoAWS