Skip to content

Software Engineer, ML Platform

Builds foundational ML platform infrastructure including model serving pipelines, GPU scheduling systems, and CI/CD for large-scale multimodal AI models. Requires 5+ years in distributed systems with expertise in Python, Kubernetes, and AWS.

188k – 395kPalo Alto, CAML EngineeringHybrid5+ YOE

About the role

What You'll Do

  • Architect end-to-end model serving pipelines and integrate new model architectures from our research team into our core, high-throughput inference engine.
  • Build robust and sophisticated scheduling systems to manage jobs based on cluster availability and user priority, ensuring we optimally leverage thousands of expensive GPU resources.
  • Design and implement dynamic, traffic-based systems for hotswapping models on our GPU workers to maximize fleet efficiency and meet product SLOs.
  • Own the end-to-end CI/CD pipelines, including creating a resilient artifact store to manage all model checkpoints across multiple versions and providers.
  • Develop and maintain user-friendly APIs and interaction patterns that empower our product and research teams to ship groundbreaking features at high velocity.
  • Manage and optimize our complex inference workloads at scale, operating across multiple clusters and hardware providers.

Who You Are

We are looking for a world-class builder who has a proven history of creating and managing large-scale, high-performance systems. You are a non-negotiable fit if you have:

  • 5+ years of professional engineering experience with deep, hands-on proficiency in Python and complex distributed systems architecture.
  • Extensive, practical experience building and managing systems at scale, specifically with queues, scheduling, traffic-control, and fleet management.
  • Deep expertise in our core infrastructure stack: Linux, Docker, and Kubernetes.
  • Strong experience with Redis, S3-compatible storage, and public cloud platforms (AWS).

What Sets You Apart (Bonus Points)

  • Experience with high-performance, large-scale ML systems (managing >100 GPUs).
  • Deep familiarity with PyTorch and CUDA.
  • Experience with modern networking stacks, including RDMA (RoCE, Infiniband, NVLink).
  • Familiarity with FFmpeg and multimedia processing pipelines.

Compensation

The base pay range for this role is $187,500 – $395,000 per year.

Skills

PythonKubernetesDockerLinuxRedisAWSPyTorchCUDAS3Rdma

Similar roles

ML Engineering jobs

Software Engineer, Inference

Develops and optimizes inference engines for multimodal AI models, integrating new architectures, building scheduling systems, and managing large-scale GPU deployments. Requires strong Python, model serving frameworks like PyTorch/vLLM, and Kubernetes expertise.

188k – 395kPalo Alto, CAML EngineeringHybridvLLMLinux

Research Scientist / Engineer – Training Infrastructure

Builds and optimizes distributed training infrastructure for large-scale multimodal AI models across thousands of GPUs. Requires deep expertise in PyTorch, CUDA, parallelization techniques, and GPU clusters.

188k – 395kPalo Alto, CAML EngineeringHybridMpiCUDA

Machine Learning Engineer

Design, train, and deploy large-scale ML recommendation systems and models that power personalization and discovery on Reddit. Requires a Master's degree and 3+ years building production ML systems.

188k – 260kSan Francisco, CAML EngineeringRemote3+ YOES3AWS

Software Engineer, Machine Learning Platform

Build and operate Chime's ML platform on AWS, including distributed training systems, feature stores, data pipelines, and CI/CD tooling. Partner with ML teams to improve reliability, observability, and developer experience for production models.

187k – 259kSan Francisco, CAML EngineeringHybrid5+ YOEGoAWS

Agentic AI Engineer

Builds production-ready agentic AI systems including runtimes, orchestration, reliability, observability, and integrations with LLMs/APIs. Requires strong backend experience, shipped agent/LLM systems, and production reliability expertise.

187k – 253kSeattle, WA +1ML EngineeringRemoteGoRust