Engineer, Supercomputing & Distributed Systems
San Francisco, CADevOps / SREOnsite
Summary
Builds and operates supercomputing infrastructure for AI research including 1000+ GPU Kubernetes clusters, distributed data pipelines processing petabytes, and fault-tolerant training systems. Requires strong distributed systems intuition and experience with Python, PyTorch, and large-scale infrastructure.
About the role
Responsibilities
Distributed data systems
- Design multi-stage pipelines that turn petabytes of raw data into clean, annotated datasets
- Run classification models on billions of images
- Deploy and combine LLMs to caption massive multimedia data
GPU infrastructure
- Manage distributed training and inference on 1000+ GPU Kubernetes clusters
- Solve orchestration and scaling for large-scale GPU job processing
- Scale workloads and research between clusters in multiple datacenters
Distributed training
- Profile and optimize dataloaders streaming thousands of images per second
- Profile and debug InfiniBand networking on huge training runs
- Build fault tolerance systems for large-scale pretraining
- Collaborate with researchers on evolving RL infrastructure
Applied ML pipelines
- Find clean scenes in millions of videos using distributed shot-boundary detection
- Customize and train models to filter billions of images for questions like "is this a screenshot?"
- Build the systems that bridge raw cluster capacity and research output
Requirements
- Intuition for distributed systems and great mental model of how systems interact under different conditions
- Work heavily with Python, Kubernetes, Torch, and data tools like DuckDB, Arrow
Strong candidates may have experience with:
- Python, PyArrow, DuckDB, SQL, massive relational databases, PyTorch, Pandas, NumPy
- Kubernetes
- Designing and implementing large-scale ETL systems
- Fundamental knowledge of containerization, operating systems, file-systems, and networking
- Distributed systems design
- Distributed training systems (NCCL, InfiniBand, RDMA)
- Streaming and event processing systems (Kafka, Pulsar, or similar)
- PyTorch internals, custom dataloaders, and training infrastructure
Skills
PythonKubernetesPyTorchDuckDBPyArrowInfiniBandNCCLRDMASQLETLArrowPandasNumPyKafkaPulsar