Skip to content

Distributed Systems Engineer, Data & Inference Platform

San Francisco, CAFremont, CAPalo Alto, CABerkeley, CAML EngineeringHybrid5+ YOE
Summary

Builds and operates distributed inference systems for LLMs at scale and large-scale data pipelines for training/evaluation. Requires 5+ years in production distributed systems, GPU expertise, and frameworks like Ray/Spark.

About the role

Responsibilities

  • Serve Models at Scale: Design and operate distributed inference systems for LLMs, optimizing throughput, latency, and cost across heterogeneous GPU fleets. Batching, scheduling, KV cache management, autoscaling.
  • Move the Data: Build large-scale data pipelines (Ray Data, Spark, or equivalents) that ingest, transform, and curate the datasets behind training and evaluation.
  • Debug the Undebuggable: Chase down failure modes under production traffic — stragglers, head-of-line blocking, silent data corruption, GPU memory fragmentation — and write postmortems. Define SLOs, build observability, own on-call rotation.
  • Partner Across the Stack: Work directly with researchers and ML engineers to productionize experimental workloads.

Qualifications

  • 5+ years building and operating distributed systems in production.
  • Deep experience with at least one large-scale data or compute framework (Ray, Spark, Flink, Beam, Dask).
  • Strong fluency in Python and at least one systems language (Go, Rust, C++).
  • Working knowledge of the GPU/accelerator stack: CUDA fundamentals, NCCL, mixed precision, memory layout.
  • Experience operating Kubernetes-based infrastructure, including custom operators or schedulers.
  • Track record of owning hard production incidents end-to-end.

Bonus

  • Hands-on experience with LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI), modern lakehouse formats (Iceberg, Delta, Hudi), or open-source contributions to relevant projects.

Benefits

  • Flexible work: In-person collaboration in the Bay Area.
  • Adaption Passport: Annual travel stipend.
  • Lunch Stipend: Weekly meal allowance.
  • Well-Being: Comprehensive medical benefits and generous paid time off.
Skills
RaySparkKubernetesPythonGoRustC++CUDANCCLvLLMTensorRT-LLM