HPC/ GPU Cluster Architect
Designs, architects, and scales production GPU/HPC clusters globally. Debugs hardware/software issues, automates operations, and mentors juniors. Requires 5+ years experience and hybrid SF presence.
Builds, deploys, and optimizes large-scale Kubernetes and Slurm clusters for AI training and inference. Requires 3+ years in ML infrastructure, expert Kubernetes/Slurm admin, Python/C++, and PyTorch experience.
Required:
Required Skills:
Preferred Skills:
Designs, architects, and scales production GPU/HPC clusters globally. Debugs hardware/software issues, automates operations, and mentors juniors. Requires 5+ years experience and hybrid SF presence.
Builds and owns platform infrastructure from scratch including CI/CD, Terraform, AWS services, monitoring, SSO, and APIs for a scaling sales coaching startup. Requires deep experience in cloud infra, IaC, containers, databases, and observability.
Leads platform engineering team to architect scalable infrastructure, deliver high-impact initiatives like developer tooling and reliability systems, while mentoring engineers and staying hands-on with code. Requires 3+ years managing platform teams and deep cloud/CI-CD expertise.
Build and scale enterprise GenAI infrastructure across multi-cloud providers (AWS, Azure, GCP), implementing integrations and architecting systems for regulated industries. Requires 4+ years experience, proficiency in Python/JS/SQL, Kubernetes, and AI technologies like LLMs.