Responsibilities
-
Advance inference efficiency end-to-end
- Design and prototype algorithms, architectures, and scheduling strategies for low-latency, high-throughput inference.
- Implement and maintain changes in high-performance inference engines (e.g., SGLang or vLLM-style systems, speculative decoding like ATLAS, quantization).
- Profile and optimize performance across GPU, networking, and memory layers.
-
Unify inference with RL / post-training
- Design and operate RL and post-training pipelines (e.g., RLHF, RLAIF, GRPO, DPO-style methods, reward modeling).
- Optimize RL workloads with inference-aware techniques like async rollouts and speculative decoding.
- Train, evaluate, and iterate on frontier models.
- Co-design algorithms and infrastructure to identify bottlenecks.
- Run ablations and scale-up experiments.
-
Own critical systems at production scale
- Profile, debug, and optimize under real workloads.
- Drive roadmap items requiring engine modifications.
- Establish metrics, benchmarks, and experimentation frameworks.
-
Provide technical leadership (Staff level)
- Set technical direction for cross-team efforts.
- Mentor engineers and researchers.
Requirements
Deep expertise in one or more areas with breadth to work across the stack:
- Bias toward implementation and shipping.
- Expertise in: large-scale inference systems (SGLang, vLLM), RL/post-training for LLMs (GRPO, RLHF), model architecture, distributed systems/HPC for ML.
- Strong Python coding, performance profiling/optimization.
- Research foundation with track record (papers, open-source, production).
Minimum qualifications
- 3+ years experience in ML systems, model training/inference, or equivalent.
- Advanced degree in Computer Science, EE, or related field, or equivalent.
- Experience owning complex technical projects end-to-end.
Compensation
US base salary range: $200,000 - $280,000 + equity + benefits.