Software Engineer, Workload Enablement
Software Engineer enabling production AI workloads on new hardware platforms through porting, benchmarking, stress testing, and performance optimization. Requires 5+ years in ML systems, distributed training, PyTorch, and RDMA/NCCL expertise.
Key Responsibilities
- Port and validate key inference and training workloads on new platforms/SKUs as they arrive; drive correctness, performance, and stability to an internal readiness bar.
- Build a suite of benchmarks and stress tests that capture real E2E behavior of our workloads by exercising all aspects of a system, including CPU, GPU, memory subsystem, frontend, scale-up, and scale-out networking (including WAN traffic, NVlink and RDMA collectives), storage, thermals, and any other relevant parts.
- Deep-dive performance on distributed training/inference:
- Collective performance and tuning (across NCCL/RCCL and internal libraries)
- Overlap of compute/communication, kernel-level bottlenecks, memory bandwidth and scheduling effects
- Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs (pass/fail, performance score, regression detection).
- Partner with systems + fleet bring-up engineers to ensure the platform is not only stable and performant, but also operationally usable and scalable (containerization, K8s integration, telemetry hooks, failure triage loops).
- Work cross-functionally with vendors and internal stakeholders by producing clear bug reports, minimal repros, and prioritized issue lists.
Qualifications
- BS in CS/EE (or equivalent practical experience).
- 5+ years in one or more of: ML systems, performance engineering, distributed systems, or HPC.
- Strong hands-on experience with:
- PyTorch and modern LLM training/inference stacks
- Large-scale distributed training concepts (data/model/pipeline parallel, collective comms)
- Experience with RDMA and debugging/optimizing comms libraries (NCCL or RCCL) and their interaction with hardware/network
- Proficiency in Python plus comfort reading/writing performance-critical code (C++/CUDA/HIP is a plus).
- Strong profiling/debugging skills (e.g., Nsight, rocprof, perf, flamegraphs; ability to reason from traces/counters).
Preferred Skills
- Experience building workload-shaped benchmarks and stress/fault tests that correlate to production behavior (not just synthetic loops or microbenchmarks).
- Familiarity with RDMA networking and transport tuning; understanding of how network topology and congestion impact collectives.
- Experience running and validating workloads in Kubernetes, and bridging “research code” into robust, repeatable infrastructure.
- Hands-on lab experience with early hardware (new NICs, new GPUs/accelerators, early racks).
Senior Staff Machine Learning Engineer, Communication & Connectivity
Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.
Staff Software Engineer
Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.
Member of Technical Staff — Model Optimization and Inference
Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.