Skip to content

Senior Machine Learning Systems Engineer

217k – 303kUnited StatesML EngineeringRemote5+ YOE
Summary

Build large-scale ML experimentation and training orchestration platforms, including agentic AI execution systems, to accelerate Ads ML development at Reddit. Requires 5+ years infrastructure experience and 2+ years building production ML platforms.

About the role

What You’ll Do

  • Design and build large-scale offline ML experimentation platforms that enable reproducible research, model development, evaluation, and promotion workflows.
  • Develop production-grade training orchestration frameworks supporting distributed training, hyperparameter optimization, model evaluation, and automated retraining.
  • Build infrastructure for experiment tracking, metadata management, lineage, artifact versioning, model registries, and reproducibility.
  • Partner with ML engineers and researchers to improve experimentation velocity and operational efficiency.
  • Build automated workflows for model promotion, rollback, compliance validation, and continuous evaluation.
  • Design and build an agentic AI execution platform supporting autonomous and human-in-the-loop workflows, including multi-agent orchestration, memory/context systems, and scalable workflow infrastructure.

What You Bring

  • 5+ years in infrastructure/platform engineering or large-scale distributed systems.
  • 2+ years of hands-on experience building and operating production ML infrastructure, developer SDKs, platform APIs, or self-service AI tooling.
  • Experience building workflow orchestration systems, developer platforms, or large-scale automation frameworks.
  • Experience with distributed data processing systems such as Spark, Flink, Ray, or equivalent technologies.
  • Experience with modern orchestration and workflow technologies such as Kubeflow, Argo, Airflow, or similar frameworks.
  • Experience building offline ML experimentation platforms, model registries, experiment tracking systems, or training orchestration frameworks.
  • Experience building and operating agentic AI systems, including multi-agent orchestration, autonomous workflows, and agent communication/runtime frameworks (e.g., MCP, A2A, and orchestration systems) is a strong plus.
  • Experience running end-to-end model development and iteration cycles at scale is a plus.
Skills
SparkFlinkRayKubeflowArgoAirflowML experimentation platformsmodel registriesexperiment trackingtraining orchestrationagentic AI systemsmulti-agent orchestration
Similar roles at this salary range
All ML Engineering jobs →
Grafana Labs

Staff AI Engineer

Staff AI Engineer building and shipping LLM/agent-powered observability features for incident detection, triage, and resolution. Requires strong production software engineering experience plus practical GenAI/LLM application skills.

175k – 220kUnited StatesML EngineeringRemote7+ YOEAWSGCP
Airbnb

Staff Machine Learning Engineer

Build and deploy cutting-edge ML and Generative AI systems to transform Airbnb's customer support experience, focusing on LLM fine-tuning, RAG, and intelligent service automation.

212k – 260kSan Francisco, CAML EngineeringRemote9+ YOELLMRAG
Pinterest

Staff Software Engineer, Trends Machine Learning Infrastructure

Lead technical direction for Pinterest's unified AI-powered Trends and Audience Insights platform. Architect scalable ML data pipelines and LLM capabilities while mentoring engineers and driving cross-team integrations.

177k – 365kSan Francisco, CAML EngineeringHybrid8+ YOELLMsCodex
Mixpanel

Senior Software Engineer, AI Platform

Senior Software Engineer building scalable AI infrastructure, agent orchestration frameworks, evaluation systems, and high-performance LLM serving at Mixpanel. Requires 5+ years experience and hands-on LLM/agent work.

226k – 306kSan Francisco, CAML EngineeringHybrid5+ YOELLMsMLOps
Twilio

Tech Lead, Applied Research

Tech Lead driving AI R&D and end-to-end delivery of production-ready prototypes using full-stack development, LLMs, and emerging technologies. Requires 10+ years experience and strong autonomy.

228k – 335kUnited StatesML EngineeringRemote10+ YOEGoSQL