Requirements
- Strong expertise in at least one area, with interest to grow across: large-scale inference systems (SGLang, vLLM, FasterTransformer, TensorRT), GPU performance, distributed serving; RL/post-training for LLMs (GRPO, RLHF/RLAIF, DPO); Transformer architectures; distributed systems/HPC for ML.
- Comfortable from algorithms to engines: strong Python coding, profiling/optimizing GPU/networking/memory, implementing production-grade features.
- Solid research foundation: track record in ML systems/RL/large-scale training (papers, open-source, production); ability to read papers and implement changes.
- Full-stack problem-solving: identify bottlenecks, collaborate across teams.
Minimum qualifications:
- 3+ years in ML systems, large-scale model training/inference, or equivalent.
- Advanced degree in CS, EE, or related field, or equivalent experience.
- Experience owning complex technical projects end-to-end.
Responsibilities
- Advance inference efficiency: Design/prototype algorithms/architectures/scheduling; implement in engines (SGLang/vLLM, ATLAS, quantization); profile/optimize GPU/networking/memory.
- Unify inference with RL/post-training: Design/operate RL pipelines (RLHF, RLAIF, GRPO, DPO); optimize with inference-aware techniques (async rollouts, speculative decoding); train/evaluate frontier models; co-design algorithms/infra; run ablations.
- Own production systems: Profile/debug/optimize services; drive engine modifications (kernels, scheduling, APIs); establish metrics/benchmarks.
- Technical leadership (Staff level): Set direction for cross-team efforts; mentor engineers/researchers.
Compensation
US base salary: $200,000 - $280,000 + equity + benefits.