Software Engineer, Workload Enablement

293k – 455kSan Francisco, CASeattle, WAHybrid5+ YOEMar 28

Summary

Software Engineer enabling production AI workloads on new hardware platforms through porting, benchmarking, stress testing, and performance optimization. Requires 5+ years in ML systems, distributed training, PyTorch, and RDMA/NCCL expertise.

About the role

Key Responsibilities

Port and validate key inference and training workloads on new platforms/SKUs as they arrive; drive correctness, performance, and stability to an internal readiness bar.
Build a suite of benchmarks and stress tests that capture real E2E behavior of our workloads by exercising all aspects of a system, including CPU, GPU, memory subsystem, frontend, scale-up, and scale-out networking (including WAN traffic, NVlink and RDMA collectives), storage, thermals, and any other relevant parts.
Deep-dive performance on distributed training/inference:
- Collective performance and tuning (across NCCL/RCCL and internal libraries)
- Overlap of compute/communication, kernel-level bottlenecks, memory bandwidth and scheduling effects
Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs (pass/fail, performance score, regression detection).
Partner with systems + fleet bring-up engineers to ensure the platform is not only stable and performant, but also operationally usable and scalable (containerization, K8s integration, telemetry hooks, failure triage loops).
Work cross-functionally with vendors and internal stakeholders by producing clear bug reports, minimal repros, and prioritized issue lists.

Qualifications

BS in CS/EE (or equivalent practical experience).
5+ years in one or more of: ML systems, performance engineering, distributed systems, or HPC.
Strong hands-on experience with:
- PyTorch and modern LLM training/inference stacks
- Large-scale distributed training concepts (data/model/pipeline parallel, collective comms)
- Experience with RDMA and debugging/optimizing comms libraries (NCCL or RCCL) and their interaction with hardware/network
Proficiency in Python plus comfort reading/writing performance-critical code (C++/CUDA/HIP is a plus).
Strong profiling/debugging skills (e.g., Nsight, rocprof, perf, flamegraphs; ability to reason from traces/counters).

Preferred Skills

Experience building workload-shaped benchmarks and stress/fault tests that correlate to production behavior (not just synthetic loops or microbenchmarks).
Familiarity with RDMA networking and transport tuning; understanding of how network topology and congestion impact collectives.
Experience running and validating workloads in Kubernetes, and bridging “research code” into robust, repeatable infrastructure.
Hands-on lab experience with early hardware (new NICs, new GPUs/accelerators, early racks).

Skills

PyTorchNCCLRCCLRDMAKubernetesPythonC++CUDAHIPNsightperfdistributed systemsML systemsHPCLLM training

Similar roles at this salary range

All ML Engineering jobs →

Anthropic

Jun 8

Staff Software Engineer, Inference

Build and maintain distributed inference systems serving Claude to millions of users. Design intelligent routing, autoscaling, and high-performance infrastructure across diverse AI accelerators.

320k – 485kSan Francisco, CA +2ML EngineeringHybridAWSGCP

Airbnb

Jun 8

Senior Staff Machine Learning Engineer, Communication & Connectivity

Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.

244k – 305kUnited StatesML EngineeringRemotePythonAI Systems

Traba

Jun 8

Staff Software Engineer

Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.

240k – 300kNew York, NY +1ML EngineeringOn-siteLLMKafka

Nuance Labs

Jun 5

Member of Technical Staff — Model Optimization and Inference

Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.

250k – 350kSeattle, WAML EngineeringOn-siteAWQvLLM

OpenAI

Jun 5

Researcher: Agent Post-Training, API & Power-Users

Improve agentic model capabilities for API and power users by designing experiments, building evals from real workflows, and driving post-training interventions from discovery through launch.

295k – 445kSan Francisco, CAML EngineeringHybridRLLLMs

Apply