Skip to content

Software Engineer, Workload Enablement

293k – 455kSan Francisco, CASeattle, WAHybrid5+ YOE
Summary

Software Engineer enabling production AI workloads on new hardware platforms through porting, benchmarking, stress testing, and performance optimization. Requires 5+ years in ML systems, distributed training, PyTorch, and RDMA/NCCL expertise.

About the role

Key Responsibilities

  • Port and validate key inference and training workloads on new platforms/SKUs as they arrive; drive correctness, performance, and stability to an internal readiness bar.
  • Build a suite of benchmarks and stress tests that capture real E2E behavior of our workloads by exercising all aspects of a system, including CPU, GPU, memory subsystem, frontend, scale-up, and scale-out networking (including WAN traffic, NVlink and RDMA collectives), storage, thermals, and any other relevant parts.
  • Deep-dive performance on distributed training/inference:
    • Collective performance and tuning (across NCCL/RCCL and internal libraries)
    • Overlap of compute/communication, kernel-level bottlenecks, memory bandwidth and scheduling effects
  • Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs (pass/fail, performance score, regression detection).
  • Partner with systems + fleet bring-up engineers to ensure the platform is not only stable and performant, but also operationally usable and scalable (containerization, K8s integration, telemetry hooks, failure triage loops).
  • Work cross-functionally with vendors and internal stakeholders by producing clear bug reports, minimal repros, and prioritized issue lists.

Qualifications

  • BS in CS/EE (or equivalent practical experience).
  • 5+ years in one or more of: ML systems, performance engineering, distributed systems, or HPC.
  • Strong hands-on experience with:
    • PyTorch and modern LLM training/inference stacks
    • Large-scale distributed training concepts (data/model/pipeline parallel, collective comms)
    • Experience with RDMA and debugging/optimizing comms libraries (NCCL or RCCL) and their interaction with hardware/network
  • Proficiency in Python plus comfort reading/writing performance-critical code (C++/CUDA/HIP is a plus).
  • Strong profiling/debugging skills (e.g., Nsight, rocprof, perf, flamegraphs; ability to reason from traces/counters).

Preferred Skills

  • Experience building workload-shaped benchmarks and stress/fault tests that correlate to production behavior (not just synthetic loops or microbenchmarks).
  • Familiarity with RDMA networking and transport tuning; understanding of how network topology and congestion impact collectives.
  • Experience running and validating workloads in Kubernetes, and bridging “research code” into robust, repeatable infrastructure.
  • Hands-on lab experience with early hardware (new NICs, new GPUs/accelerators, early racks).
Skills
PyTorchNCCLRCCLRDMAKubernetesPythonC++CUDAHIPNsightperfdistributed systemsML systemsHPCLLM training
Similar roles at this salary range
All ML Engineering jobs →
Anthropic

Staff Software Engineer, Inference

Build and maintain distributed inference systems serving Claude to millions of users. Design intelligent routing, autoscaling, and high-performance infrastructure across diverse AI accelerators.

320k – 485kSan Francisco, CA +2ML EngineeringHybridAWSGCP
Airbnb

Senior Staff Machine Learning Engineer, Communication & Connectivity

Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.

244k – 305kUnited StatesML EngineeringRemotePythonAI Systems
Traba

Staff Software Engineer

Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.

240k – 300kNew York, NY +1ML EngineeringOn-siteLLMKafka
Nuance Labs

Member of Technical Staff — Model Optimization and Inference

Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.

250k – 350kSeattle, WAML EngineeringOn-siteAWQvLLM
OpenAI

Researcher: Agent Post-Training, API & Power-Users

Improve agentic model capabilities for API and power users by designing experiments, building evals from real workflows, and driving post-training interventions from discovery through launch.

295k – 445kSan Francisco, CAML EngineeringHybridRLLLMs