Senior Machine Learning Operations Engineer
Build and operate production ML systems and platform components for healthcare technology, partnering with ML and data science teams on model deployment, observability, and reliability.
What you will do
- Help ensure the reliability, performance, functionality, and cost-efficiency of Garner's production ML systems, contributing to SLOs, observability, and on-call responsibilities.
- Build key components of Garner's ML platform, including data infrastructure (such as a feature store, model registry, and CI/CD for models) and standardized service patterns.
- Implement ML-specific CI/CD pipelines: Help transition our deployment process from manual notebook hand-offs to automated, PR-driven CI/CD workflows that include automated data quality checks and statistical model validation prior to deployment.
- Drive down cost and latency through improved architecture, hardware choices, and model optimization as appropriate.
- Contribute to the workflows, standards, and KPIs that support a growing MLOps function, helping teammates and stakeholders quickly identify the health of the team's products and focus on areas where issues reside.
- Help establish drift monitoring: Design and implement automated data drift and concept drift monitoring systems that alert the team when models degrade, laying the groundwork for future Continuous Training (CT) architectures.
The ideal candidate has
- 5+ years of software engineering experience, with meaningful time spent operating ML or data-intensive systems in production.
- Hands-on experience with the modern ML production stack: model serving (e.g., Sagemaker, Triton, or equivalent), feature stores, model registries, and CI/CD for ML.
- Strong infrastructure and platform engineering fundamentals: Kubernetes, containerization, cloud (AWS preferred), Terraform/IaC, observability, and incident response.
- Experience building ML platforms or significant components of one (not strictly consuming SaaS), with sound judgment around when to build vs. buy.
- Strong collaboration with ML, data, platform engineers, data scientists, and product engineering teams, with the ability to lead projects and influence technical decisions.
- Healthcare, regulated-data, or other high-stakes production ML experience is a plus but not required.
- A desire to be a part of a high-performing, mission-driven team that operates with intense urgency, a strong sense of individual accountability, and a commitment to authentic feedback.
Technologies we use
Python, Kubernetes, AWS, Sagemaker, Terraform, S3, Snowflake, Airflow, Datadog
Machine Learning Engineer - Embedded Insights
Drive ML initiatives from concept to production on the Embedded Insights team. Identify opportunities, build and deploy models using Plaid's financial datasets, and partner with product teams to deliver scalable customer-facing intelligence products.
Machine Learning Engineer
Advance Plaid’s foundation models by developing novel architectures, pretraining objectives, and fine-tuning strategies. Work across the full ML stack from data engineering to production serving and monitoring.
Staff Machine Learning Engineer, Notifications Relevance
Technical leader for Reddit's Notifications Relevance ML systems, driving large-scale recommendation systems spanning retrieval, ranking, budget optimization, and LLM-powered experiences.
Member of Technical Staff — Audio and Voice AI
Design, build, and deploy production-grade voice and audio AI systems including real-time agents and speech-driven workflows for financial operations. Requires 5+ years engineering experience with focus on applied AI/ML or speech systems.
Senior/Staff Software Engineer - Planner Frameworks Pipeline
Build and optimize large-scale simulation and ML training pipelines on Ray and Kubernetes to validate autonomous vehicle behavior. Requires 8+ years experience and strong distributed systems background.