Responsibilities

Serve Models at Scale: Design and operate distributed inference systems for LLMs, optimizing throughput, latency, and cost across heterogeneous GPU fleets. Batching, scheduling, KV cache management, autoscaling.
Move the Data: Build large-scale data pipelines (Ray Data, Spark, or equivalents) that ingest, transform, and curate the datasets behind training and evaluation.
Debug the Undebuggable: Chase down failure modes under production traffic — stragglers, head-of-line blocking, silent data corruption, GPU memory fragmentation — and write postmortems. Define SLOs, build observability, own on-call rotation.
Partner Across the Stack: Work directly with researchers and ML engineers to productionize experimental workloads.

Qualifications

5+ years building and operating distributed systems in production.
Deep experience with at least one large-scale data or compute framework (Ray, Spark, Flink, Beam, Dask).
Strong fluency in Python and at least one systems language (Go, Rust, C++).
Working knowledge of the GPU/accelerator stack: CUDA fundamentals, NCCL, mixed precision, memory layout.
Experience operating Kubernetes-based infrastructure, including custom operators or schedulers.
Track record of owning hard production incidents end-to-end.

Hands-on experience with LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI), modern lakehouse formats (Iceberg, Delta, Hudi), or open-source contributions to relevant projects.