Skip to content

AI Infrastructure Engineer - Training Platform

Builds and scales high-performance training platforms for large-scale GPU clusters, architecting orchestration, scheduling, and observability for ML workloads. Requires 5+ years in infrastructure engineering with ML focus, Kubernetes expertise, and systems programming.

216k – 270kSan Francisco, CASeattle, WANew York, NYDevOps / SREOnsite5+ YOE

About the role

Responsibilities

  • Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery.
  • Design and implement scheduling primitives to optimize the lifecycle of training jobs.
  • Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures.
  • Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability.
  • Work closely with Finance and Procurement teams to drive our capacity planning process.
  • Participate in our team’s on call process to ensure the availability of our services.
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.

Requirements

  • 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes).
  • Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++).
  • Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling.
  • Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling.
  • Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput.
  • Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware.
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform).
  • Proven ability to solve complex problems and work independently in fast-moving environments.

Nice to Haves

  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
  • Experience with the NVIDIA software and hardware stack (CUDA, NCCL).
  • Experience with PyTorch.
  • Familiarity with post-training algorithms such as GRPO, and with Reinforcement Learning.

Compensation

  • Base salary range: $216,000—$270,000 USD (varies by location, skills, experience).
  • Equity and comprehensive benefits (health, dental, vision, retirement, PTO, learning stipend).

Skills

KubernetesPythonGoRustC++AWSGCPTerraformRayKueueDeepspeedFsdpCUDANcclPyTorch

Similar roles

DevOps / SRE jobs

Software Engineer, Platform

Design and build foundational data platforms, cloud infrastructure, and orchestration systems supporting AI/ML products. Requires 3+ years backend experience with Kubernetes, Terraform, Docker, AWS, Temporal, MongoDB, and Postgres.

216k – 270kSan Francisco, CA +1DevOps / SREOn-site3+ YOEAWSDocker

Infrastructure Software Engineer, Enterprise GenAI

Build and scale enterprise GenAI infrastructure across multi-cloud providers (AWS, Azure, GCP), implementing integrations and architecting systems for regulated industries. Requires 4+ years experience, proficiency in Python/JS/SQL, Kubernetes, and AI technologies like LLMs.

216k – 270kSan Francisco, CA +2DevOps / SREOn-site4+ YOESQLAWS

Software Engineer, Platform

Owns and scales platform infrastructure including edge/cloud services on Cloudflare, GCP, Vercel and data layers like Spanner, ClickHouse, Postgres to serve millions of LLM requests daily. Requires 5+ years in production infrastructure with cloud platforms, databases, and full-stack TypeScript expertise.

215k – 285kUnited StatesDevOps / SRERemote5+ YOEGCPAWS

HPC/ GPU Cluster Architect

Designs, architects, and scales production GPU/HPC clusters globally. Debugs hardware/software issues, automates operations, and mentors juniors. Requires 5+ years experience and hybrid SF presence.

220k – 300kSan Francisco, CADevOps / SREHybrid5+ YOEGPUHpc

Software Engineer, Platform

Builds and owns platform infrastructure from scratch including CI/CD, Terraform, AWS services, monitoring, SSO, and APIs for a scaling sales coaching startup. Requires deep experience in cloud infra, IaC, containers, databases, and observability.

220k – 300kNew York, NYDevOps / SREOn-siteAWSECS