Skip to content

ML Platform Engineer

Austin, TXML EngineeringOnsite
Summary

Builds and scales ML compute platform on Kubernetes with Argo Workflows and Ray for distributed training, orchestration, and resource governance. Optimizes performance, debugs issues, and integrates tooling for ML teams at scale. Requires deep Kubernetes, systems, and programming expertise.

About the role

What you will do

  • Build and scale our ML compute platform on Kubernetes, using Argo Workflows for training, evaluation, and data processing orchestration
  • Design and implement core platform capabilities, including a Ray-based internal SDK for distributed execution, and multi-tenant resource governance — scheduling, priorities, quotas, and policy enforcement across GPU, CPU, memory, and IO
  • Improve end-to-end training throughput and platform efficiency by optimizing data access patterns, caching, and removing bottlenecks in storage, network, and resource contention
  • Work directly with ML teams to debug complex workload issues, drive root-cause analysis, and turn recurring problems into platform-level fixes
  • Evaluate, integrate and extend open-source tooling (Argo Workflows, Ray, Kubernetes ecosystem) to meet evolving platform needs

What you will need

  • Strong proficiency in Python or Go; C++ is a plus
  • Track record of designing and building scalable, maintainable systems and services
  • Experience operating production services end-to-end: APIs, reliability practices, observability
  • Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle actually behave under pressure
  • Solid Linux and systems debugging skills: performance investigation, networking, storage/IO
  • Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution

Nice to have

  • Experience with Argo Workflows, Ray, MLflow, or comparable distributed ML tooling
  • Hands-on experience building or operating large-scale ML training systems: GPU scheduling, distributed training, training data pipelines
  • Track record of optimizing resource usage and performance in distributed environments
Skills
KubernetesPythonGoArgo WorkflowsRayLinuxMLflowobservabilitydistributed trainingGPU scheduling