Distributed Systems Engineer, Data & Inference Platform
San Francisco, CAFremont, CAPalo Alto, CABerkeley, CAML EngineeringHybrid5+ YOE
Summary
Builds and operates distributed inference systems for LLMs at scale and large-scale data pipelines for training/evaluation. Requires 5+ years in production distributed systems, GPU expertise, and frameworks like Ray/Spark.
About the role
Responsibilities
- Serve Models at Scale: Design and operate distributed inference systems for LLMs, optimizing throughput, latency, and cost across heterogeneous GPU fleets. Batching, scheduling, KV cache management, autoscaling.
- Move the Data: Build large-scale data pipelines (Ray Data, Spark, or equivalents) that ingest, transform, and curate the datasets behind training and evaluation.
- Debug the Undebuggable: Chase down failure modes under production traffic — stragglers, head-of-line blocking, silent data corruption, GPU memory fragmentation — and write postmortems. Define SLOs, build observability, own on-call rotation.
- Partner Across the Stack: Work directly with researchers and ML engineers to productionize experimental workloads.
Qualifications
- 5+ years building and operating distributed systems in production.
- Deep experience with at least one large-scale data or compute framework (Ray, Spark, Flink, Beam, Dask).
- Strong fluency in Python and at least one systems language (Go, Rust, C++).
- Working knowledge of the GPU/accelerator stack: CUDA fundamentals, NCCL, mixed precision, memory layout.
- Experience operating Kubernetes-based infrastructure, including custom operators or schedulers.
- Track record of owning hard production incidents end-to-end.
Bonus
- Hands-on experience with LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI), modern lakehouse formats (Iceberg, Delta, Hudi), or open-source contributions to relevant projects.
Benefits
- Flexible work: In-person collaboration in the Bay Area.
- Adaption Passport: Annual travel stipend.
- Lunch Stipend: Weekly meal allowance.
- Well-Being: Comprehensive medical benefits and generous paid time off.
Skills
RaySparkKubernetesPythonGoRustC++CUDANCCLvLLMTensorRT-LLM