ML Model Serving Engineer

175k – 280kSan Francisco, CANew York, NYBellevue, WAML EngineeringOnsiteMar 14

Summary

Optimizes and extends ML model serving infrastructure for LLMs, speech, and vision models, focusing on high-throughput, low-latency inference using frameworks like VLLM and SGLang. Requires deep PyTorch expertise, systems programming, and performance engineering for reliable production deployment.

About the role

Responsibilities

Turbocharge our serving layer, consisting of a variety of LLM, speech, and vision models.
Partner with ML infrastructure and training engineers to build a fast, cost-effective, accurate, and reliable serving layer to power a new consumer product category.
Modify and extend LLM serving frameworks like VLLM and SGLang to take advantage of the latest techniques in high-performance model serving.
Work with the training team to identify opportunities to produce faster models without sacrificing quality.
Use techniques like in-flight batching, caching, and custom kernels to speed up inference.
Find ways to reduce model initialization times without sacrificing quality.

Required Qualifications

Expert in some differentiable array computing framework, preferably PyTorch.
Expert in optimizing machine learning models for serving reliably at high throughput, with low latency.
Significant systems programming experience (e.g., working on high-performance server systems—comfortable with the internals of VLLM as with a complex PyTorch codebase).
Significant performance engineering experience (e.g., bottleneck analysis in high-scale server systems or profiling low-level systems code).
Always up to date on the latest techniques for model serving optimization.

Preferred Qualifications

Familiarity with high-performance LLM serving (e.g., experience with VLLM, SGlang deployment, and internals).
Experience with a public cloud platform such as GCP, AWS, or Azure.
Experience deploying and scaling inference workloads in the cloud using Kubernetes, Ray, etc.
Track record of leading complex multi-month projects without assistance.

Benefits

401(k) max employer match: 3.5% of compensation
100% employer-paid health, vision, and dental benefits for you and your dependents
Unlimited PTO and sick time
Flexible spending account with employer matching up to $1,650/year (medical FSA)
Guardian Employee Assistance Program (EAP)
Competitive stock options

Skills

PyTorchVLLMSGLangKubernetesRayGCPAWSAzureLLM servingperformance optimization

Similar roles at this salary range

All ML Engineering jobs →

Mem0

Jun 19

Senior Research Engineer

Own the end-to-end lifecycle of memory features for AI agents. Fine-tune models, implement research, build evaluations, and ship production systems with Engineering.

175k – 250kSan Francisco, CAML EngineeringOn-site7+ YOERAGvLLM

Ironclad

Jun 18

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals

Mercury

Jun 18

Senior Machine Learning Operations Engineer

Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.

167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP

Distyl AI

Jun 18

AI Engineer, Evaluation

Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.

150k – 250kSan Francisco, CA +1ML EngineeringHybrid2+ YOEPythonAI Systems

Grafana Labs

Jun 18

Senior AI Engineer

Senior Engineer building multi-agent AI systems, LLM integrations, and backend automation services that power Marketing Operations. Owns technical direction for agentic infrastructure connecting models to business systems.

154k – 185kUnited StatesML EngineeringRemote8+ YOERAGGit

Apply