Research Infrastructure Engineer, Training Systems
Builds and maintains infrastructure for large-scale ML model training and experimentation. Designs APIs, improves reliability and performance of training pipelines, and debugs issues across Python, PyTorch, distributed systems, GPUs, and networking.
Responsibilities
- Build and maintain infrastructure for large-scale model training and experimentation.
- Design APIs and interfaces that make complex training workflows easier to express and harder to misuse.
- Improve reliability, debuggability, and performance across training and data pipelines.
- Debug issues spanning Python, PyTorch, distributed systems, GPUs, networking, and storage.
- Write tests, benchmarks, and diagnostics that catch meaningful regressions.
Senior Staff Machine Learning Engineer, Communication & Connectivity
Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.
Staff Software Engineer
Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.
Member of Technical Staff — Model Optimization and Inference
Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.