Senior Machine Learning Systems Engineer
Build large-scale ML experimentation and training orchestration platforms, including agentic AI execution systems, to accelerate Ads ML development at Reddit. Requires 5+ years infrastructure experience and 2+ years building production ML platforms.
What You’ll Do
- Design and build large-scale offline ML experimentation platforms that enable reproducible research, model development, evaluation, and promotion workflows.
- Develop production-grade training orchestration frameworks supporting distributed training, hyperparameter optimization, model evaluation, and automated retraining.
- Build infrastructure for experiment tracking, metadata management, lineage, artifact versioning, model registries, and reproducibility.
- Partner with ML engineers and researchers to improve experimentation velocity and operational efficiency.
- Build automated workflows for model promotion, rollback, compliance validation, and continuous evaluation.
- Design and build an agentic AI execution platform supporting autonomous and human-in-the-loop workflows, including multi-agent orchestration, memory/context systems, and scalable workflow infrastructure.
What You Bring
- 5+ years in infrastructure/platform engineering or large-scale distributed systems.
- 2+ years of hands-on experience building and operating production ML infrastructure, developer SDKs, platform APIs, or self-service AI tooling.
- Experience building workflow orchestration systems, developer platforms, or large-scale automation frameworks.
- Experience with distributed data processing systems such as Spark, Flink, Ray, or equivalent technologies.
- Experience with modern orchestration and workflow technologies such as Kubeflow, Argo, Airflow, or similar frameworks.
- Experience building offline ML experimentation platforms, model registries, experiment tracking systems, or training orchestration frameworks.
- Experience building and operating agentic AI systems, including multi-agent orchestration, autonomous workflows, and agent communication/runtime frameworks (e.g., MCP, A2A, and orchestration systems) is a strong plus.
- Experience running end-to-end model development and iteration cycles at scale is a plus.
Staff AI Engineer
Staff AI Engineer building and shipping LLM/agent-powered observability features for incident detection, triage, and resolution. Requires strong production software engineering experience plus practical GenAI/LLM application skills.
Staff Software Engineer, Trends Machine Learning Infrastructure
Lead technical direction for Pinterest's unified AI-powered Trends and Audience Insights platform. Architect scalable ML data pipelines and LLM capabilities while mentoring engineers and driving cross-team integrations.
Senior Software Engineer, AI Platform
Senior Software Engineer building scalable AI infrastructure, agent orchestration frameworks, evaluation systems, and high-performance LLM serving at Mixpanel. Requires 5+ years experience and hands-on LLM/agent work.