# AI Infra Engineer
**Company:** [Perplexity](https://hotfix.jobs/companies/perplexity)
**Location:** San Francisco, CA, Palo Alto, CA
**Salary:** $220K-$405K
**Experience:** 3+ years
**Skills:** Kubernetes, Slurm, Python, C++, PyTorch, AWS, Terraform, Ansible, CUDA, TensorFlow
**Posted:** 2025-07-15
> Builds, deploys, and optimizes large-scale Kubernetes and Slurm clusters for AI training and inference. Requires 3+ years in ML infrastructure, expert Kubernetes/Slurm admin, Python/C++, and PyTorch experience.
## Job Description
## Responsibilities
- Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
- Manage and optimize Slurm-based HPC environments for distributed training of large language models
- Develop robust APIs and orchestration systems for both training pipelines and inference services
- Implement resource scheduling and job management systems across heterogeneous compute environments
- Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
- Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
- Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
- Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

## Qualifications
**Required:**
- Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
- Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
- Experience with deploying and managing distributed training systems at scale
- Deep understanding of container orchestration and distributed systems architecture
- High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
- Experience managing GPU clusters and optimizing compute resource utilization

**Required Skills:**
- Expert-level Kubernetes administration and YAML configuration management
- Proficiency with Slurm job scheduling, resource management, and cluster configuration
- Python and C++ programming with focus on systems and infrastructure automation
- Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
- Strong understanding of networking, storage, and compute resource management for ML workloads
- Experience developing APIs and managing distributed systems for both batch and real-time workloads
- Solid debugging and monitoring skills with expertise in observability tools for containerized environments

**Preferred Skills:**
- Experience with Kubernetes operators and custom controllers for ML workloads
- Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
- Familiarity with GPU cluster management and CUDA optimization
- Experience with other ML frameworks like TensorFlow or distributed training libraries
- Background in HPC environments, parallel computing, and high-performance networking
- Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
- Experience with container registries, image optimization, and multi-stage builds for ML workloads

## Required Experience
- Demonstrated experience managing large-scale Kubernetes deployments in production environments
- Proven track record with Slurm cluster administration and HPC workload management
- Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
- Experience supporting both long-running training jobs and high-availability inference services
- Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management
**Apply:** https://hotfix.jobs/jobs/ai-infra-engineer-at-perplexity-e8b47a29-b977-44af-827e-10077c17d288
**Canonical:** https://hotfix.jobs/jobs/ai-infra-engineer-at-perplexity-e8b47a29-b977-44af-827e-10077c17d288