Skip to content

Senior Virtualization Validation Engineer

173k – 210kSan Francisco, CASunnyvale, CAOnsite5+ YOE
Summary

Validates large-scale multi-node GPU clusters using QEMU and Cloud Hypervisor, focusing on interconnects like NVLink/InfiniBand, collective communications (NCCL/RCCL), and performance in virtualized AI/HPC environments. Requires 5+ years experience, virtualization expertise, and Linux kernel knowledge.

About the role

What You’ll Be Working On

  • Multi-Node Scaling Validation: Design and execute large-scale validation tests across multi-node virtualized clusters to ensure linear scaling and stability of GPU workloads.
  • Interconnect & Fabric Testing: Validate high-speed interconnects—including NVLink, Infinity Fabric, InfiniBand, and RoCE—within virtualized environments to ensure low-latency, high-bandwidth communication.
  • Hypervisor & GPU Virtualization: Lead the validation of QEMU and Cloud Hypervisor with a focus on PCIe passthrough (VFIO), IOMMU, and direct device assignment for GPUs and high-speed NICs.
  • Collective Communication Benchmarking: Architect and run comprehensive test suites using nccl-tests and rccl-tests (e.g., AllReduce, AllGather) to verify performance across node boundaries.
  • Network Stack Validation: Validate SR-IOV and RDMA configurations to ensure that virtualized guests achieve near-bare-metal networking performance for distributed GPU tasks.
  • Automated Cluster Orchestration: Develop and maintain automation frameworks in Python or Go to dynamically provision, configure, and stress-test multi-node virtualized environments.
  • Performance Bottleneck Analysis: Perform deep-dive analysis of performance regressions in multi-node communication, identifying root causes across the guest OS, hypervisor, and physical fabric.

What You’ll Bring to the Team

Education & Experience: 5+ YOE demonstrated ability to competently and independently perform responsibilities plus Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field.

Virtualization Expertise: Proven experience with QEMU/KVM and Cloud Hypervisor in a production or research environment.

Distributed GPU Ecosystems: Deep familiarity with NVIDIA (CUDA/NCCL) and/or AMD (ROCm/RCCL) stacks in a multi-node context.

Networking Knowledge: Strong understanding of RDMA, RoCE, and InfiniBand protocols and their implementation in virtualized systems.

System Internals: Expert-level knowledge of Linux kernel internals, specifically PCIe topology, VFIO, and memory management (HugePages, IOMMU).

Automation & Scripting: Advanced proficiency in Python and/or Bash for automating complex cluster-wide test scenarios.

Bonus Points:

  • Experience with MNNVL (Multi-Node NVLink) or specialized AI fabric architectures.
  • Familiarity with hardware-level debugging tools and performance profilers (e.g., NVIDIA Nsight, AMD Omniperf).
  • Knowledge of containerized orchestration for GPUs (e.g., Kubernetes with specialized device plugins).

Compensation

Compensation will be paid in the range of $172,500 - $210,000. Restricted Stock Units are included in all offers.

Skills
QEMUCloud HypervisorKVMNCCLRCCLCUDAROCmNVLinkInfiniBandRoCERDMASR-IOVVFIOIOMMUPCIe passthrough
Similar roles at this salary range
All DevOps / SRE jobs →
Aurelian

Senior Infrastructure Engineer

Build analytics infrastructure, observability tooling, and developer platforms to support real-time AI agents for 911 centers. Requires 4+ years infrastructure/platform/backend experience and comfort across the full stack.

150k – 200kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Aurelian

Staff Infrastructure Engineer

Build infrastructure, observability, and developer tooling for a realtime AI platform serving 911 centers. Requires 6+ years infrastructure/platform/backend experience and comfort across the full stack.

180k – 240kSeattle, WADevOps / SREOn-siteLoggingClickHouse
Stuut

Lead Site Reliability Engineer

Lead SRE driving reliability strategy, infrastructure architecture, observability, and incident response for a B2B fintech platform on AWS and Kubernetes. Requires 7+ years building production-grade distributed systems.

200k – 275kSan Francisco, CADevOps / SREOn-siteAWSEKS
Huntress

Senior Developer Experience Engineer

Senior Platform Engineer focused on Developer Experience building tools, automation, CI/CD systems, and AI tooling to improve developer productivity and workflows. Requires 7+ years cloud experience, containerization, and proficiency in Ruby, Go, or Python.

160k – 190kUnited StatesDevOps / SRERemoteGoRuby
Crusoe

Staff Network Engineer, Operations

Staff-level network operations engineer responsible for production reliability, incident response, and operational excellence across Crusoe's global edge, backbone, data center, and GPU cluster networks supporting AI workloads.

195k – 235kSan Francisco, CADevOps / SREOn-siteBGPQoS