AI Infrastructure Engineer - Training Platform

Builds and scales high-performance training platforms for large-scale GPU clusters, architecting orchestration, scheduling, and observability for ML workloads. Requires 5+ years in infrastructure engineering with ML focus, Kubernetes expertise, and systems programming.

216k – 270kSan Francisco, CASeattle, WANew York, NYDevOps / SREOnsite5+ YOE

Apply

About the role

Responsibilities

Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery.
Design and implement scheduling primitives to optimize the lifecycle of training jobs.
Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures.
Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability.
Work closely with Finance and Procurement teams to drive our capacity planning process.
Participate in our team’s on call process to ensure the availability of our services.
Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.

Requirements

5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes).
Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++).
Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling.
Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling.
Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput.
Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware.
Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform).
Proven ability to solve complex problems and work independently in fast-moving environments.

Nice to Haves

Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
Experience with the NVIDIA software and hardware stack (CUDA, NCCL).
Experience with PyTorch.
Familiarity with post-training algorithms such as GRPO, and with Reinforcement Learning.

Compensation

Base salary range: $216,000—$270,000 USD (varies by location, skills, experience).
Equity and comprehensive benefits (health, dental, vision, retirement, PTO, learning stipend).

Skills

KubernetesPythonGoRustC++AWSGCPTerraformRayKueueDeepspeedFsdpCUDANcclPyTorch

Similar roles

DevOps / SRE jobs

Scale AI

Software Engineer, Platform

Design and build foundational data platforms, cloud infrastructure, and orchestration systems supporting AI/ML products. Requires 3+ years backend experience with Kubernetes, Terraform, Docker, AWS, Temporal, MongoDB, and Postgres.

216k – 270kSan Francisco, CA +1DevOps / SREOn-site3+ YOEAWSDocker

Scale AI

Infrastructure Software Engineer, Enterprise GenAI

Build and scale enterprise GenAI infrastructure across multi-cloud providers (AWS, Azure, GCP), implementing integrations and architecting systems for regulated industries. Requires 4+ years experience, proficiency in Python/JS/SQL, Kubernetes, and AI technologies like LLMs.

216k – 270kSan Francisco, CA +2DevOps / SREOn-site4+ YOESQLAWS

OpenRouter

Software Engineer, Platform

Owns and scales platform infrastructure including edge/cloud services on Cloudflare, GCP, Vercel and data layers like Spanner, ClickHouse, Postgres to serve millions of LLM requests daily. Requires 5+ years in production infrastructure with cloud platforms, databases, and full-stack TypeScript expertise.

215k – 285kUnited StatesDevOps / SRERemote5+ YOEGCPAWS

Sfcompute

HPC/ GPU Cluster Architect

Designs, architects, and scales production GPU/HPC clusters globally. Debugs hardware/software issues, automates operations, and mentors juniors. Requires 5+ years experience and hybrid SF presence.

220k – 300kSan Francisco, CADevOps / SREHybrid5+ YOEGPUHpc

Rilla

Software Engineer, Platform

Builds and owns platform infrastructure from scratch including CI/CD, Terraform, AWS services, monitoring, SSO, and APIs for a scaling sales coaching startup. Requires deep experience in cloud infra, IaC, containers, databases, and observability.

220k – 300kNew York, NYDevOps / SREOn-siteAWSECS