Skip to content

Software Engineer, Inference - Performance Optimization

295k – 555kSan Francisco, CAOnsite
Summary

Models inference performance across application, model, and fleet layers using microbenchmarks to build cost-to-serve estimates. Analyzes workloads end-to-end, enhances bottleneck detection tools, and collaborates on optimizations for latency, throughput, and cost.

About the role

Responsibilities

  • Build and refine performance models that translate microbenchmark results into cost-to-serve estimates.
  • Analyze inference workloads end to end across applications, models, and fleet infrastructure.
  • Enhance tooling to identify bottlenecks across layers for latency and throughput.
  • Partner with other teams to turn performance insights into concrete improvements and project how future changes affect inference.

Requirements

  • Enjoy reasoning from first principles about distributed systems, model inference, and hardware efficiency.
  • Comfortable working across abstraction layers, from application behavior to kernels, accelerators, networking, and fleet scheduling.
  • Deep expertise with performance profiling, benchmarking, analysis, and optimization.
  • Enjoy collaborating with engineering and research teams to improve real production systems.
Skills
performance modelingprofilingbenchmarkingdistributed systemsmodel inferencehardware optimizationkernelsacceleratorsnetworkingfleet scheduling
Similar roles at this salary range
All ML Engineering jobs →
Anthropic

Staff Software Engineer, Inference

Build and maintain distributed inference systems serving Claude to millions of users. Design intelligent routing, autoscaling, and high-performance infrastructure across diverse AI accelerators.

320k – 485kSan Francisco, CA +2ML EngineeringHybridAWSGCP
Airbnb

Senior Staff Machine Learning Engineer, Communication & Connectivity

Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.

244k – 305kUnited StatesML EngineeringRemotePythonAI Systems
Traba

Staff Software Engineer

Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.

240k – 300kNew York, NY +1ML EngineeringOn-siteLLMKafka
Nuance Labs

Member of Technical Staff — Model Optimization and Inference

Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.

250k – 350kSeattle, WAML EngineeringOn-siteAWQvLLM
OpenAI

Researcher: Agent Post-Training, API & Power-Users

Improve agentic model capabilities for API and power users by designing experiments, building evals from real workflows, and driving post-training interventions from discovery through launch.

295k – 445kSan Francisco, CAML EngineeringHybridRLLLMs