ML Model Serving Engineer
175k – 280kSan Francisco, CANew York, NYBellevue, WAML EngineeringOnsite
Summary
Optimizes and extends ML model serving infrastructure for LLMs, speech, and vision models, focusing on high-throughput, low-latency inference using frameworks like VLLM and SGLang. Requires deep PyTorch expertise, systems programming, and performance engineering for reliable production deployment.
About the role
Responsibilities
- Turbocharge our serving layer, consisting of a variety of LLM, speech, and vision models.
- Partner with ML infrastructure and training engineers to build a fast, cost-effective, accurate, and reliable serving layer to power a new consumer product category.
- Modify and extend LLM serving frameworks like VLLM and SGLang to take advantage of the latest techniques in high-performance model serving.
- Work with the training team to identify opportunities to produce faster models without sacrificing quality.
- Use techniques like in-flight batching, caching, and custom kernels to speed up inference.
- Find ways to reduce model initialization times without sacrificing quality.
Required Qualifications
- Expert in some differentiable array computing framework, preferably PyTorch.
- Expert in optimizing machine learning models for serving reliably at high throughput, with low latency.
- Significant systems programming experience (e.g., working on high-performance server systems—comfortable with the internals of VLLM as with a complex PyTorch codebase).
- Significant performance engineering experience (e.g., bottleneck analysis in high-scale server systems or profiling low-level systems code).
- Always up to date on the latest techniques for model serving optimization.
Preferred Qualifications
- Familiarity with high-performance LLM serving (e.g., experience with VLLM, SGlang deployment, and internals).
- Experience with a public cloud platform such as GCP, AWS, or Azure.
- Experience deploying and scaling inference workloads in the cloud using Kubernetes, Ray, etc.
- Track record of leading complex multi-month projects without assistance.
Benefits
- 401(k) max employer match: 3.5% of compensation
- 100% employer-paid health, vision, and dental benefits for you and your dependents
- Unlimited PTO and sick time
- Flexible spending account with employer matching up to $1,650/year (medical FSA)
- Guardian Employee Assistance Program (EAP)
- Competitive stock options
Skills
PyTorchVLLMSGLangKubernetesRayGCPAWSAzureLLM servingperformance optimization
Similar roles at this salary range
All ML Engineering jobs →Senior Machine Learning Operations Engineer
Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.
167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP
AI Engineer, Evaluation
Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.
150k – 250kSan Francisco, CA +1ML EngineeringHybrid2+ YOEPythonAI Systems