What you will do

Build and scale our ML compute platform on Kubernetes, using Argo Workflows for training, evaluation, and data processing orchestration
Design and implement core platform capabilities, including a Ray-based internal SDK for distributed execution, and multi-tenant resource governance — scheduling, priorities, quotas, and policy enforcement across GPU, CPU, memory, and IO
Improve end-to-end training throughput and platform efficiency by optimizing data access patterns, caching, and removing bottlenecks in storage, network, and resource contention
Work directly with ML teams to debug complex workload issues, drive root-cause analysis, and turn recurring problems into platform-level fixes
Evaluate, integrate and extend open-source tooling (Argo Workflows, Ray, Kubernetes ecosystem) to meet evolving platform needs

What you will need

Strong proficiency in Python or Go; C++ is a plus
Track record of designing and building scalable, maintainable systems and services
Experience operating production services end-to-end: APIs, reliability practices, observability
Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle actually behave under pressure
Solid Linux and systems debugging skills: performance investigation, networking, storage/IO
Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution

Experience with Argo Workflows, Ray, MLflow, or comparable distributed ML tooling
Hands-on experience building or operating large-scale ML training systems: GPU scheduling, distributed training, training data pipelines
Track record of optimizing resource usage and performance in distributed environments