Skip to content

Member of Technical Staff — Model Optimization and Inference

200k – 300kSeattle, WAOnsiteEntry level
Summary

Early-career engineer optimizing inference for real-time multimodal AI avatars. Focus on KV cache strategies, serving frameworks, quantization, and latency reduction for LLMs and diffusion models.

About the role

What You’ll Do

  • Contribute to end-to-end inference optimization across our model stack — LLMs, audio models, and diffusion-based components
  • Implement and tune KV cache strategies for long-context conversations, including eviction policies, compression, and memory-efficient attention
  • Work with inference serving frameworks (vLLM, SGLang, TensorRT-LLM, etc.) and extend them for our specific workloads
  • Profile and benchmark end-to-end latency and throughput; identify and systematically eliminate bottlenecks
  • Build internal tooling that makes optimization work faster and more rigorous — profiling viewers, end-to-end inference test harnesses, and other infrastructure that helps the team move quickly
  • Accelerate diffusion model inference — consistency models, step distillation, caching strategies, and custom kernel optimizations
  • Apply quantization techniques (INT8, INT4, GPTQ, AWQ, and beyond) to reduce memory footprint and increase throughput without meaningfully degrading quality
  • Work closely with research and infrastructure to ensure new models ship with optimized serving from day one

What We’re Looking For

  • BS, MS, or PhD in CS, ML, or a related field — completed or in the final stretch
  • Strong fundamentals in LLM inference or ML systems — KV caching, memory layout, attention kernels, batching, or serving — picked up through coursework, research, internships, or open-source
  • Exposure to inference serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — even at a research or hobby level
  • Strong Python and PyTorch skills; familiarity with CUDA or Triton is a significant plus
  • A systematic approach to profiling and optimization — you measure first, then optimize
  • Curiosity about diffusion inference, speculative decoding, quantization, or other inference-time acceleration techniques

Bonus Points

  • Internship or research experience with LLM inference, ML systems, or model serving
  • Contributions to open-source inference frameworks (vLLM, SGLang, TensorRT-LLM, etc.)
  • CUDA / Triton kernel work, even at a research or hobby scale
  • Publications or research projects in MLSys, model compression, or inference optimization
  • Familiarity with multimodal or streaming inference architectures
  • Experience with hard latency SLAs in any real-time system

Compensation

$200,000 – $300,000 base salary, plus meaningful equity.

Benefits

  • Health: HSA plan with ~$2,000 in annual company contributions
  • Time off: 15 days of PTO plus public holidays, and we close the office for a full week at year-end
  • Food: Lunch, drinks, and snacks on us every workday
  • Commuter benefits
  • 401(k): In the works
Skills
PythonPyTorchvLLMSGLangTensorRT-LLMCUDATritonKV cachingQuantizationModel optimization
Similar roles at this salary range
All ML Engineering jobs →
Coinbase

Staff Machine Learning Engineer

Staff ML Engineer leading end-to-end identity verification ML systems including document authenticity, face matching, liveness detection, GNN-based identity graphs, and behavioral risk models. Requires 8+ years production ML experience and domain expertise in biometrics or fraud detection.

218k – 257kUnited StatesML EngineeringRemote8+ YOENLPLLMs
Notable

AI Platform Engineer

Design, build, and maintain LLM integrations powering AI features. Own end-to-end delivery from requirements through production monitoring with focus on scalability and reliability.

170k – 205kSan Mateo, CAML EngineeringHybrid5+ YOEGKEHelm
Hinge Health

Staff Machine Learning Scientist

Own ML systems for send-time optimization, propensity modeling, and nudge decisions at consumer scale. Set experimentation standards and mentor a small ML team.

205k – 307kSan Francisco, CAML EngineeringHybrid7+ YOESQLdbt
Docker

Staff ML Engineer

Founding Staff ML Engineer building production ML systems for governance, security, and agentic platform capabilities at Docker. Owns architecture, data pipelines, evaluation, and model lifecycle while mentoring the growing team.

205k – 330kPalo Alto, CA +1ML EngineeringRemote8+ YOELLMsRetrieval
Reddit

Senior Research Engineer, Post-training & Evaluation

Own evaluation science and post-training methodology for Reddit's foundational LLMs. Define benchmarks, design model-as-a-judge systems, and set SFT recipes that turn base models into safe, Reddit-native endpoints.

230k – 322kUnited StatesML EngineeringRemote6+ YOESFTCPT