Skip to content

AI Inference Engineer - Model Optimization & Deployment

242k – 290kFoster City, CASan Diego, CASeattle, WAHybrid
Summary

Optimizes and deploys large-scale AI models (LLMs, VLMs) for real-time inference on power-constrained vehicle hardware. Requires expertise in quantization, TensorRT compilation, custom CUDA kernels, and production C++/Python for edge devices.

About the role

Responsibilities

  • Optimize large-scale models (LLMs, VLMs) using advanced quantization (PTQ, QAT), mixed-precision inference workflows, and parameter-efficient fine-tuning (LoRA, QLoRA).
  • Architect and implement model conversion and compilation pipelines using TensorRT and TensorRT-LLM for edge deployment.
  • Perform rigorous parity checking, accuracy recovery, and latency benchmarking between PyTorch frameworks and compiled edge binaries.
  • Write and optimize custom CUDA kernels and TensorRT Plugins to maximize memory bandwidth and minimize latency on AI accelerators.
  • Write production-level, highly concurrent, and memory-safe C++ and Python code for real-time inference on vehicle SOCs.

Qualifications

  • Deep expertise in model quantization (PTQ, QAT) and mixed-precision inference workflows (INT8, FP8, INT4, BF16/FP16).
  • Proven experience optimizing large-scale models (LLMs, VLMs, or VLAs) utilizing KV-cache optimization (e.g., PagedAttention), Speculative Decoding, and Efficient Attention mechanisms (FlashAttention, Linear Attention).
  • Extensive experience with model conversion/compilation pipelines (TensorRT, TensorRT-LLM) and performing rigorous parity/latency benchmarking.
  • Proficiency in low-level programming for AI accelerators, specifically writing and optimizing custom CUDA kernels and TensorRT Plugins.
  • Production-level C++ (14/17/20) and Python programming skills, with experience writing concurrent, memory-safe, real-time inference code for edge devices.

Bonus Qualifications

  • Experience with distributed training pipelines and model/tensor parallelism (PyTorch Distributed, Ray, DeepSpeed, Megatron-LM) and runtime efficiency optimization for GPU clusters.
  • Familiarity with autonomous driving perception stacks (temporal 3D object detection, BEV, 3D Occupancy Networks) and processing multi-modal sensor streams (Vision, LiDAR, Radar).
  • Understanding of end-to-end autonomous driving paradigms (VLA models, closed-loop simulation validation).
Skills
TensorRTTensorRT-LLMCUDAPyTorchC++PythonPTQQATLoRAQLoRAFlashAttentionPagedAttentionDeepSpeedRay
Similar roles at this salary range
All ML Engineering jobs →
Airbnb

Senior Staff Machine Learning Engineer, Communication & Connectivity

Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.

244k – 305kUnited StatesML EngineeringRemotePythonAI Systems
Traba

Staff Software Engineer

Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.

240k – 300kNew York, NY +1ML EngineeringOn-siteLLMKafka
Traba

Senior Software Engineer

Founding Senior Applied Agent Engineer building production LLM agent systems that automate supply chain workflows. Requires 5+ years engineering experience with 1+ year shipping LLM/agent features, strong Python/TypeScript skills, and hands-on agent stack experience.

200k – 240kNew York, NY +1ML EngineeringOn-sitePythonNode.js
Cribl

Staff Software Engineer, Cribl AI

Staff-level AI/ML engineer building and productionizing generative AI features across backend and frontend for Cribl's observability platform. Requires 6+ years experience, AI/ML and MLOps background, and TypeScript/JavaScript proficiency.

225k – 265kUnited StatesML EngineeringRemoteLLMsReact
Perplexity

Member of Technical Staff

ML Engineer building and optimizing production recommendation, ranking, and personalization systems that integrate LLMs for Perplexity's AI product.

220k – 405kSan Francisco, CA +1ML EngineeringOn-siteLLMsFeature Stores