Skip to content

Engineer, Supercomputing & Distributed Systems

San Francisco, CADevOps / SREOnsite
Summary

Builds and operates supercomputing infrastructure for AI research including 1000+ GPU Kubernetes clusters, distributed data pipelines processing petabytes, and fault-tolerant training systems. Requires strong distributed systems intuition and experience with Python, PyTorch, and large-scale infrastructure.

About the role

Responsibilities

Distributed data systems

  • Design multi-stage pipelines that turn petabytes of raw data into clean, annotated datasets
  • Run classification models on billions of images
  • Deploy and combine LLMs to caption massive multimedia data

GPU infrastructure

  • Manage distributed training and inference on 1000+ GPU Kubernetes clusters
  • Solve orchestration and scaling for large-scale GPU job processing
  • Scale workloads and research between clusters in multiple datacenters

Distributed training

  • Profile and optimize dataloaders streaming thousands of images per second
  • Profile and debug InfiniBand networking on huge training runs
  • Build fault tolerance systems for large-scale pretraining
  • Collaborate with researchers on evolving RL infrastructure

Applied ML pipelines

  • Find clean scenes in millions of videos using distributed shot-boundary detection
  • Customize and train models to filter billions of images for questions like "is this a screenshot?"
  • Build the systems that bridge raw cluster capacity and research output

Requirements

  • Intuition for distributed systems and great mental model of how systems interact under different conditions
  • Work heavily with Python, Kubernetes, Torch, and data tools like DuckDB, Arrow

Strong candidates may have experience with:

  • Python, PyArrow, DuckDB, SQL, massive relational databases, PyTorch, Pandas, NumPy
  • Kubernetes
  • Designing and implementing large-scale ETL systems
  • Fundamental knowledge of containerization, operating systems, file-systems, and networking
  • Distributed systems design
  • Distributed training systems (NCCL, InfiniBand, RDMA)
  • Streaming and event processing systems (Kafka, Pulsar, or similar)
  • PyTorch internals, custom dataloaders, and training infrastructure
Skills
PythonKubernetesPyTorchDuckDBPyArrowInfiniBandNCCLRDMASQLETLArrowPandasNumPyKafkaPulsar