What You’ll Work On
- Scalable data loaders for training runs across thousands of GPUs
- Efficient storage and retrieval systems for petabyte-scale datasets
- Multi-cloud object storage abstraction
- Execute large-scale data migrations across storage systems and providers
- Debug and resolve performance bottlenecks in distributed data loading
Technical Focus
- Python, PyTorch DataLoader internals
- Object storage (e.g. S3, Azure Blob, GCS)
- Parquet for metadata
- Video: ffmpeg, PyAV, codec fundamentals
What We’re Looking For
- Built and operated data pipelines at petabyte scale
- Optimized data loading
- Worked with petabyte-scale video and image datasets
- Written processing jobs operating on millions of files
- Debugged distributed system bottlenecks across large fleets of machines
Nice to have
- Experience streaming dataset formats (e.g. WebDataset)
- Video codec internals and frame-accurate seeking
- Distributed systems experience
- Slurm and Kubernetes for job orchestration
- Experience with object storage performance tuning across providers
Base Annual Salary (SF based role): $180,000–$300,000 USD + Equity