Responsibilities

Design multi-stage pipelines that turn petabytes of raw data into clean, annotated datasets
Run classification models on billions of images
Deploy and combine LLMs to caption massive multimedia data

Find clean scenes in millions of videos using distributed shot-boundary detection
Customize and train models to filter billions of images for questions like "is this a screenshot?"
Build the systems that bridge raw cluster capacity and research output

Requirements

Intuition for distributed systems and great mental model of how systems interact under different conditions
Work heavily with Python, Kubernetes, Torch, and data tools like DuckDB, Arrow

Strong candidates may have experience with:

Python, PyArrow, DuckDB, SQL, massive relational databases, PyTorch, Pandas, NumPy
Kubernetes
Designing and implementing large-scale ETL systems
Fundamental knowledge of containerization, operating systems, file-systems, and networking
Distributed systems design
Distributed training systems (NCCL, InfiniBand, RDMA)
Streaming and event processing systems (Kafka, Pulsar, or similar)
PyTorch internals, custom dataloaders, and training infrastructure