Manager, HPC Storage Engineer

150k – 240kUnited StatesRemote3+ YOEJan 28

Summary

Leads engineering team building and operating large-scale distributed storage infrastructure for AI/HPC workloads, including SAN, NFS, VAST Data, and parallel filesystems. Requires 3+ years managing teams and 8+ years in storage systems with deep expertise in high-performance data paths.

About the role

Responsibilities

Own Distributed Storage Architecture: Define, evolve, and operate Runpod’s global storage platforms, supporting training, inference, checkpointing, and dataset access at scale.
Build the Storage Engineering Team: Manage and grow a team of storage and systems engineers. Set clear ownership, technical direction, and operational standards across regions.
High-Performance Shared Filesystems: Design and operate large-scale SAN and NFS deployments, including performance-sensitive shared storage for GPU clusters.
Advanced Filesystems & Platforms: Lead deployments and operations of VAST Data and experience with Lustre or similar parallel filesystems used in HPC and AI environments.
End-to-End Performance Ownership: Drive performance optimization from NAND and NVMe media through controllers, networking, and client access patterns.
Next-Generation Storage Technologies: Evaluate and deploy cutting-edge capabilities such as NFS over RDMA, GPU Direct Storage (GDS), and low-latency data paths for accelerated workloads.
Reliability & Scale: Establish best practices for replication, data tiering, data protection, failure recovery, capacity planning, and lifecycle management.
Automation & Observability: Build automation for provisioning, expansion, upgrades, and monitoring. Ensure deep observability into throughput, latency, and error characteristics.
Cross-Functional Collaboration: Partner with Datacenter Networking, GPU Platform, SRE, and Product teams to ensure storage systems meet evolving workload and customer needs.
Vendor & Partner Management: Own technical relationships with storage vendors, hardware partners, and colocation providers; drive roadmap alignment and issue resolution.

Requirements

Engineering Leadership Experience: 3+ years managing storage, systems, or infrastructure engineering teams in production environments.
Distributed Storage Expertise: 8+ years designing and operating large-scale storage systems, including SAN and NFS architectures at multi-petabyte scale.
VAST Data Experience: Hands-on experience deploying, operating, or deeply integrating VAST Data in production environments is required.
Parallel Filesystems: Experience with Lustre or comparable HPC filesystems (e.g., GPFS, BeeGFS) supporting high-concurrency workloads.
Low-Level Storage Knowledge: Deep understanding of NAND, NVMe, PCIe, storage controllers, and performance characteristics across the stack.
High-Performance Data Paths: Proven experience with NFS over RDMA, RDMA-capable transports, or similar technologies. Familiarity with GPU Direct Storage strongly preferred.
Linux Systems Expertise: Strong Linux internals knowledge, including filesystems, I/O scheduling, memory management, and tuning for performance workloads.
Operational Excellence: Experience running 24/7 storage platforms with strong incident response, change management, and post-mortem discipline.
Communication & Leadership: Ability to clearly communicate complex technical tradeoffs and lead teams through high-stakes infrastructure decisions.

Preferred Qualifications

Experience supporting AI training pipelines, large-scale model checkpointing, and dataset streaming workloads.
Familiarity with RDMA fabrics and close collaboration with datacenter networking teams.
Experience designing storage systems for multi-tenant isolation and secure data access.
Background in hyperscale, HPC, or AI-focused infrastructure environments.
Experience building internal storage platforms or abstractions consumed by product teams.

Compensation

Competitive base pay: $150,000 - $240,000 USD
Meaningful equity (stock options)
Generous medical, dental & vision plans (100% coverage for employees)
Flexible PTO

Skills

VAST DataLustreSANNFSNVMeNANDRDMAGPU Direct StorageLinuxHPC filesystems

Similar roles at this salary range

All DevOps / SRE jobs →

Fivetran

Jun 18

Senior Site Reliability Engineer

Senior SRE responsible for production infrastructure reliability, incident response, deployment automation, and scaling SaaS systems on Kubernetes and major cloud platforms.

175k – 210kOakland, CADevOps / SREHybrid5+ YOEAWSGCP

Forterra

Jun 18

Senior Software Engineer-Internal Tools

Senior Software Engineer on the DevOps and Tooling team building internal tools. Requires 3-5+ years experience, Rust or strong systems background, TypeScript/React, Linux, Docker, and CI/CD.

125k – 140kArlington, VA +1DevOps / SREOn-site5+ YOEAWSRust

Beacon AI

Jun 17

Software Engineer, Cloud Infrastructure

Build and operate AWS cloud infrastructure and LLM platform services including RAG pipelines, vector search, model endpoints, and data ingestion for an aviation AI company.

135k – 260kSan Carlos, CADevOps / SREHybrid4+ YOEAWSGlue

MongoDB

Jun 17

Site Reliability Engineer

Senior or Staff Site Reliability Engineer focused on continuous delivery infrastructure using Argo Workflows, ArgoCD, and Kubernetes. Owns deployment tooling, onboarding flows, and participates in 24/7 on-call. Requires 6+ years building and operating distributed systems.

127k – 249kBoston, MA +6DevOps / SREHybrid6+ YOEGoAWS

CommandLink

Jun 17

Senior Network Engineer

Senior Network Engineer building and supporting carrier interconnects, private circuits, NNIs, and cloud connectivity for a managed network services provider. Requires hands-on service provider experience with Layer 2/3 protocols and direct carrier coordination.

120k – 160kUnited StatesDevOps / SRERemote5+ YOEBGPVRF

Apply