Software Engineer, Inference - Performance Optimization
Models inference performance across application, model, and fleet layers using microbenchmarks to build cost-to-serve estimates. Analyzes workloads end-to-end, enhances bottleneck detection tools, and collaborates on optimizations for latency, throughput, and cost.
Responsibilities
- Build and refine performance models that translate microbenchmark results into cost-to-serve estimates.
- Analyze inference workloads end to end across applications, models, and fleet infrastructure.
- Enhance tooling to identify bottlenecks across layers for latency and throughput.
- Partner with other teams to turn performance insights into concrete improvements and project how future changes affect inference.
Requirements
- Enjoy reasoning from first principles about distributed systems, model inference, and hardware efficiency.
- Comfortable working across abstraction layers, from application behavior to kernels, accelerators, networking, and fleet scheduling.
- Deep expertise with performance profiling, benchmarking, analysis, and optimization.
- Enjoy collaborating with engineering and research teams to improve real production systems.
Senior Staff Machine Learning Engineer, Communication & Connectivity
Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.
Staff Software Engineer
Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.
Member of Technical Staff — Model Optimization and Inference
Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.