HPC/ GPU Cluster Architect

Designs, architects, and scales production GPU/HPC clusters globally. Debugs hardware/software issues, automates operations, and mentors juniors. Requires 5+ years experience and hybrid SF presence.

220k – 300kSan Francisco, CADevOps / SREHybrid5+ YOE

Apply

About the role

Responsibilities

Architect and deploy new GPU/HPC clusters around the world
Keep clusters running smoothly
Participate in on-call rotation
Deploy new environments and fix issues
Lean into automation for deployments at scale
Mentor junior engineers and shape team culture

Requirements

5+ years experience designing, architecting, and scaling HPC or GPU compute clusters in production
Deep understanding of server hardware: GPUs, NICs, PCIe, memory, thermals, power
Comfortable debugging performance/reliability across hardware, OS, drivers, networking
Automate fleet operations (provisioning, monitoring, remediation) with infrastructure-as-code
Generate strong operational documentation and runbooks
Open to SF office 3-4 days/week and domestic travel

Nice to Haves

Data center operations: power, cooling, colo/vendor engagements
Strong Linux sysadmin: kernel drivers, RDMA tuning, performance analysis
Schedulers/orchestration: Slurm, Kubernetes
Virtualization: KVM, QEMU, libvirt
Telemetry for predictive hardware failure
High-speed fabrics: InfiniBand, RoCEv2 Ethernet

Skills

GPUHpcKubernetesSlurmInfiniBandRdmaLinuxPcieInfrastructure As CodeKvm

Similar roles

DevOps / SRE jobs

Rilla

Software Engineer, Platform

Builds and owns platform infrastructure from scratch including CI/CD, Terraform, AWS services, monitoring, SSO, and APIs for a scaling sales coaching startup. Requires deep experience in cloud infra, IaC, containers, databases, and observability.

220k – 300kNew York, NYDevOps / SREOn-siteAWSECS

Rillet

Engineering Manager - Platform

Leads platform engineering team to architect scalable infrastructure, deliver high-impact initiatives like developer tooling and reliability systems, while mentoring engineers and staying hands-on with code. Requires 3+ years managing platform teams and deep cloud/CI-CD expertise.

220k – 260kNew York, NY +1DevOps / SREHybridAWSGCP

Perplexity

AI Infra Engineer

Builds, deploys, and optimizes large-scale Kubernetes and Slurm clusters for AI training and inference. Requires 3+ years in ML infrastructure, expert Kubernetes/Slurm admin, Python/C++, and PyTorch experience.

220k – 405kSan Francisco, CA +1DevOps / SREOn-site3+ YOEC++AWS

Scale AI

Infrastructure Software Engineer, Enterprise GenAI

Build and scale enterprise GenAI infrastructure across multi-cloud providers (AWS, Azure, GCP), implementing integrations and architecting systems for regulated industries. Requires 4+ years experience, proficiency in Python/JS/SQL, Kubernetes, and AI technologies like LLMs.

216k – 270kSan Francisco, CA +2DevOps / SREOn-site4+ YOESQLAWS

Scale AI

Software Engineer, Platform

Design and build foundational data platforms, cloud infrastructure, and orchestration systems supporting AI/ML products. Requires 3+ years backend experience with Kubernetes, Terraform, Docker, AWS, Temporal, MongoDB, and Postgres.

216k – 270kSan Francisco, CA +1DevOps / SREOn-site3+ YOEAWSDocker