AI Inference Engineer - Model Optimization & Deployment

242k – 290kFoster City, CASan Diego, CASeattle, WAHybridApr 11

Summary

Optimizes and deploys large-scale AI models (LLMs, VLMs) for real-time inference on power-constrained vehicle hardware. Requires expertise in quantization, TensorRT compilation, custom CUDA kernels, and production C++/Python for edge devices.

About the role

Responsibilities

Optimize large-scale models (LLMs, VLMs) using advanced quantization (PTQ, QAT), mixed-precision inference workflows, and parameter-efficient fine-tuning (LoRA, QLoRA).
Architect and implement model conversion and compilation pipelines using TensorRT and TensorRT-LLM for edge deployment.
Perform rigorous parity checking, accuracy recovery, and latency benchmarking between PyTorch frameworks and compiled edge binaries.
Write and optimize custom CUDA kernels and TensorRT Plugins to maximize memory bandwidth and minimize latency on AI accelerators.
Write production-level, highly concurrent, and memory-safe C++ and Python code for real-time inference on vehicle SOCs.

Qualifications

Deep expertise in model quantization (PTQ, QAT) and mixed-precision inference workflows (INT8, FP8, INT4, BF16/FP16).
Proven experience optimizing large-scale models (LLMs, VLMs, or VLAs) utilizing KV-cache optimization (e.g., PagedAttention), Speculative Decoding, and Efficient Attention mechanisms (FlashAttention, Linear Attention).
Extensive experience with model conversion/compilation pipelines (TensorRT, TensorRT-LLM) and performing rigorous parity/latency benchmarking.
Proficiency in low-level programming for AI accelerators, specifically writing and optimizing custom CUDA kernels and TensorRT Plugins.
Production-level C++ (14/17/20) and Python programming skills, with experience writing concurrent, memory-safe, real-time inference code for edge devices.

Bonus Qualifications

Experience with distributed training pipelines and model/tensor parallelism (PyTorch Distributed, Ray, DeepSpeed, Megatron-LM) and runtime efficiency optimization for GPU clusters.
Familiarity with autonomous driving perception stacks (temporal 3D object detection, BEV, 3D Occupancy Networks) and processing multi-modal sensor streams (Vision, LiDAR, Radar).
Understanding of end-to-end autonomous driving paradigms (VLA models, closed-loop simulation validation).

Skills

TensorRTTensorRT-LLMCUDAPyTorchC++PythonPTQQATLoRAQLoRAFlashAttentionPagedAttentionDeepSpeedRay

Similar roles at this salary range

All ML Engineering jobs →

Airbnb

Jun 8

Senior Staff Machine Learning Engineer, Communication & Connectivity

Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.

244k – 305kUnited StatesML EngineeringRemotePythonAI Systems

Traba

Jun 8

Staff Software Engineer

Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.

240k – 300kNew York, NY +1ML EngineeringOn-siteLLMKafka

Traba

Jun 8

Senior Software Engineer

Founding Senior Applied Agent Engineer building production LLM agent systems that automate supply chain workflows. Requires 5+ years engineering experience with 1+ year shipping LLM/agent features, strong Python/TypeScript skills, and hands-on agent stack experience.

200k – 240kNew York, NY +1ML EngineeringOn-sitePythonNode.js

Cribl

Jun 7

Staff Software Engineer, Cribl AI

Staff-level AI/ML engineer building and productionizing generative AI features across backend and frontend for Cribl's observability platform. Requires 6+ years experience, AI/ML and MLOps background, and TypeScript/JavaScript proficiency.

225k – 265kUnited StatesML EngineeringRemoteLLMsReact

Perplexity

Jun 6

Member of Technical Staff

ML Engineer building and optimizing production recommendation, ranking, and personalization systems that integrate LLMs for Perplexity's AI product.

220k – 405kSan Francisco, CA +1ML EngineeringOn-siteLLMsFeature Stores

Apply