Senior Virtualization Validation Engineer

173k – 210kSan Francisco, CASunnyvale, CAOnsite5+ YOEApr 20

Summary

Validates large-scale multi-node GPU clusters using QEMU and Cloud Hypervisor, focusing on interconnects like NVLink/InfiniBand, collective communications (NCCL/RCCL), and performance in virtualized AI/HPC environments. Requires 5+ years experience, virtualization expertise, and Linux kernel knowledge.

About the role

What You’ll Be Working On

Multi-Node Scaling Validation: Design and execute large-scale validation tests across multi-node virtualized clusters to ensure linear scaling and stability of GPU workloads.
Interconnect & Fabric Testing: Validate high-speed interconnects—including NVLink, Infinity Fabric, InfiniBand, and RoCE—within virtualized environments to ensure low-latency, high-bandwidth communication.
Hypervisor & GPU Virtualization: Lead the validation of QEMU and Cloud Hypervisor with a focus on PCIe passthrough (VFIO), IOMMU, and direct device assignment for GPUs and high-speed NICs.
Collective Communication Benchmarking: Architect and run comprehensive test suites using nccl-tests and rccl-tests (e.g., AllReduce, AllGather) to verify performance across node boundaries.
Network Stack Validation: Validate SR-IOV and RDMA configurations to ensure that virtualized guests achieve near-bare-metal networking performance for distributed GPU tasks.
Automated Cluster Orchestration: Develop and maintain automation frameworks in Python or Go to dynamically provision, configure, and stress-test multi-node virtualized environments.
Performance Bottleneck Analysis: Perform deep-dive analysis of performance regressions in multi-node communication, identifying root causes across the guest OS, hypervisor, and physical fabric.

What You’ll Bring to the Team

Education & Experience: 5+ YOE demonstrated ability to competently and independently perform responsibilities plus Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field.

Virtualization Expertise: Proven experience with QEMU/KVM and Cloud Hypervisor in a production or research environment.

Distributed GPU Ecosystems: Deep familiarity with NVIDIA (CUDA/NCCL) and/or AMD (ROCm/RCCL) stacks in a multi-node context.

Networking Knowledge: Strong understanding of RDMA, RoCE, and InfiniBand protocols and their implementation in virtualized systems.

System Internals: Expert-level knowledge of Linux kernel internals, specifically PCIe topology, VFIO, and memory management (HugePages, IOMMU).

Automation & Scripting: Advanced proficiency in Python and/or Bash for automating complex cluster-wide test scenarios.

Bonus Points:

Experience with MNNVL (Multi-Node NVLink) or specialized AI fabric architectures.
Familiarity with hardware-level debugging tools and performance profilers (e.g., NVIDIA Nsight, AMD Omniperf).
Knowledge of containerized orchestration for GPUs (e.g., Kubernetes with specialized device plugins).

Compensation

Compensation will be paid in the range of $172,500 - $210,000. Restricted Stock Units are included in all offers.

Skills

QEMUCloud HypervisorKVMNCCLRCCLCUDAROCmNVLinkInfiniBandRoCERDMASR-IOVVFIOIOMMUPCIe passthrough

Similar roles at this salary range

All DevOps / SRE jobs →

Aurelian

Jun 8

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Aurelian

Jun 8

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse

Stuut

Jun 8

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS

Huntress

Jun 8

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby

Crusoe

Jun 5

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS

Apply