Skip to content

HPC/ GPU Cluster Architect

Designs, architects, and scales production GPU/HPC clusters globally. Debugs hardware/software issues, automates operations, and mentors juniors. Requires 5+ years experience and hybrid SF presence.

220k – 300kSan Francisco, CADevOps / SREHybrid5+ YOE

About the role

Responsibilities

  • Architect and deploy new GPU/HPC clusters around the world
  • Keep clusters running smoothly
  • Participate in on-call rotation
  • Deploy new environments and fix issues
  • Lean into automation for deployments at scale
  • Mentor junior engineers and shape team culture

Requirements

  • 5+ years experience designing, architecting, and scaling HPC or GPU compute clusters in production
  • Deep understanding of server hardware: GPUs, NICs, PCIe, memory, thermals, power
  • Comfortable debugging performance/reliability across hardware, OS, drivers, networking
  • Automate fleet operations (provisioning, monitoring, remediation) with infrastructure-as-code
  • Generate strong operational documentation and runbooks
  • Open to SF office 3-4 days/week and domestic travel

Nice to Haves

  • Data center operations: power, cooling, colo/vendor engagements
  • Strong Linux sysadmin: kernel drivers, RDMA tuning, performance analysis
  • Schedulers/orchestration: Slurm, Kubernetes
  • Virtualization: KVM, QEMU, libvirt
  • Telemetry for predictive hardware failure
  • High-speed fabrics: InfiniBand, RoCEv2 Ethernet

Skills

GPUHpcKubernetesSlurmInfiniBandRdmaLinuxPcieInfrastructure As CodeKvm

Similar roles

DevOps / SRE jobs

Software Engineer, Platform

Builds and owns platform infrastructure from scratch including CI/CD, Terraform, AWS services, monitoring, SSO, and APIs for a scaling sales coaching startup. Requires deep experience in cloud infra, IaC, containers, databases, and observability.

220k – 300kNew York, NYDevOps / SREOn-siteAWSECS

Engineering Manager - Platform

Leads platform engineering team to architect scalable infrastructure, deliver high-impact initiatives like developer tooling and reliability systems, while mentoring engineers and staying hands-on with code. Requires 3+ years managing platform teams and deep cloud/CI-CD expertise.

220k – 260kNew York, NY +1DevOps / SREHybridAWSGCP

AI Infra Engineer

Builds, deploys, and optimizes large-scale Kubernetes and Slurm clusters for AI training and inference. Requires 3+ years in ML infrastructure, expert Kubernetes/Slurm admin, Python/C++, and PyTorch experience.

220k – 405kSan Francisco, CA +1DevOps / SREOn-site3+ YOEC++AWS

Infrastructure Software Engineer, Enterprise GenAI

Build and scale enterprise GenAI infrastructure across multi-cloud providers (AWS, Azure, GCP), implementing integrations and architecting systems for regulated industries. Requires 4+ years experience, proficiency in Python/JS/SQL, Kubernetes, and AI technologies like LLMs.

216k – 270kSan Francisco, CA +2DevOps / SREOn-site4+ YOESQLAWS

Software Engineer, Platform

Design and build foundational data platforms, cloud infrastructure, and orchestration systems supporting AI/ML products. Requires 3+ years backend experience with Kubernetes, Terraform, Docker, AWS, Temporal, MongoDB, and Postgres.

216k – 270kSan Francisco, CA +1DevOps / SREOn-site3+ YOEAWSDocker