# HPC/ GPU Cluster Architect
**Company:** [Sfcompute](https://hotfix.jobs/companies/sfcompute)
**Location:** San Francisco, CA
**Salary:** $220K-$300K
**Experience:** 5+ years
**Skills:** GPU, Hpc, Kubernetes, Slurm, InfiniBand, Rdma, Linux, Pcie, Infrastructure As Code, Kvm
**Posted:** 2026-04-13
> Designs, architects, and scales production GPU/HPC clusters globally. Debugs hardware/software issues, automates operations, and mentors juniors. Requires 5+ years experience and hybrid SF presence.
## Job Description
## Responsibilities
- Architect and deploy new GPU/HPC clusters around the world
- Keep clusters running smoothly
- Participate in on-call rotation
- Deploy new environments and fix issues
- Lean into automation for deployments at scale
- Mentor junior engineers and shape team culture

## Requirements
- 5+ years experience designing, architecting, and scaling HPC or GPU compute clusters in production
- Deep understanding of server hardware: GPUs, NICs, PCIe, memory, thermals, power
- Comfortable debugging performance/reliability across hardware, OS, drivers, networking
- Automate fleet operations (provisioning, monitoring, remediation) with infrastructure-as-code
- Generate strong operational documentation and runbooks
- Open to SF office 3-4 days/week and domestic travel

## Nice to Haves
- Data center operations: power, cooling, colo/vendor engagements
- Strong Linux sysadmin: kernel drivers, RDMA tuning, performance analysis
- Schedulers/orchestration: Slurm, **Kubernetes**
- Virtualization: KVM, QEMU, libvirt
- Telemetry for predictive hardware failure
- High-speed fabrics: InfiniBand, RoCEv2 Ethernet
**Apply:** https://hotfix.jobs/jobs/hpc-gpu-cluster-architect-at-sfcompute-357dc6b1-3fc0-4c6b-9168-bd1081710761
**Canonical:** https://hotfix.jobs/jobs/hpc-gpu-cluster-architect-at-sfcompute-357dc6b1-3fc0-4c6b-9168-bd1081710761