Skip to content

AI Infra Engineer

Builds, deploys, and optimizes large-scale Kubernetes and Slurm clusters for AI training and inference. Requires 3+ years in ML infrastructure, expert Kubernetes/Slurm admin, Python/C++, and PyTorch experience.

220k – 405kSan Francisco, CAPalo Alto, CADevOps / SREOnsite3+ YOE

About the role

Responsibilities

  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • Develop robust APIs and orchestration systems for both training pipelines and inference services
  • Implement resource scheduling and job management systems across heterogeneous compute environments
  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Qualifications

Required:

  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
  • Experience with deploying and managing distributed training systems at scale
  • Deep understanding of container orchestration and distributed systems architecture
  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
  • Experience managing GPU clusters and optimizing compute resource utilization

Required Skills:

  • Expert-level Kubernetes administration and YAML configuration management
  • Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • Python and C++ programming with focus on systems and infrastructure automation
  • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
  • Strong understanding of networking, storage, and compute resource management for ML workloads
  • Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments

Preferred Skills:

  • Experience with Kubernetes operators and custom controllers for ML workloads
  • Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
  • Familiarity with GPU cluster management and CUDA optimization
  • Experience with other ML frameworks like TensorFlow or distributed training libraries
  • Background in HPC environments, parallel computing, and high-performance networking
  • Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
  • Experience with container registries, image optimization, and multi-stage builds for ML workloads

Required Experience

  • Demonstrated experience managing large-scale Kubernetes deployments in production environments
  • Proven track record with Slurm cluster administration and HPC workload management
  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
  • Experience supporting both long-running training jobs and high-availability inference services
  • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

Skills

KubernetesSlurmPythonC++PyTorchAWSTerraformAnsibleCUDATensorFlow

Similar roles

DevOps / SRE jobs

HPC/ GPU Cluster Architect

Designs, architects, and scales production GPU/HPC clusters globally. Debugs hardware/software issues, automates operations, and mentors juniors. Requires 5+ years experience and hybrid SF presence.

220k – 300kSan Francisco, CADevOps / SREHybrid5+ YOEGPUHpc

Software Engineer, Platform

Builds and owns platform infrastructure from scratch including CI/CD, Terraform, AWS services, monitoring, SSO, and APIs for a scaling sales coaching startup. Requires deep experience in cloud infra, IaC, containers, databases, and observability.

220k – 300kNew York, NYDevOps / SREOn-siteAWSECS

Engineering Manager - Platform

Leads platform engineering team to architect scalable infrastructure, deliver high-impact initiatives like developer tooling and reliability systems, while mentoring engineers and staying hands-on with code. Requires 3+ years managing platform teams and deep cloud/CI-CD expertise.

220k – 260kNew York, NY +1DevOps / SREHybridAWSGCP

Infrastructure Software Engineer, Enterprise GenAI

Build and scale enterprise GenAI infrastructure across multi-cloud providers (AWS, Azure, GCP), implementing integrations and architecting systems for regulated industries. Requires 4+ years experience, proficiency in Python/JS/SQL, Kubernetes, and AI technologies like LLMs.

216k – 270kSan Francisco, CA +2DevOps / SREOn-site4+ YOESQLAWS

Software Engineer, Platform

Design and build foundational data platforms, cloud infrastructure, and orchestration systems supporting AI/ML products. Requires 3+ years backend experience with Kubernetes, Terraform, Docker, AWS, Temporal, MongoDB, and Postgres.

216k – 270kSan Francisco, CA +1DevOps / SREOn-site3+ YOEAWSDocker