Staff Machine Learning Operations Engineer

298k – 351kNew York, NYHybrid7+ YOEJun 16

Summary

Staff MLOps Engineer responsible for the reliability, performance, and cost-efficiency of production ML systems. Architect ML platform with feature stores, model registries, and automated CI/CD pipelines.

About the role

What you will do

Own the reliability, performance, functionality, and cost-efficiency of Garner's production ML systems, including establishing SLOs, observability, and on-call responsibilities.
Architect Garner's ML platform including required data infrastructure (including feature store, model registry and CI/CD for models), and standardized service patterns.
Implement ML-specific CI/CD pipelines: Transition our deployment process from manual notebook hand-offs to automated, PR-driven CI/CD workflows that include automated data quality checks and statistical model validation prior to deployment.
Drive down cost and latency through improved architecture, hardware choices, and model optimization as appropriate.
Lay the foundation for a future Garner MLOps team, including workflows, standards, and KPIs that enables rapid teammate onboarding and helps stakeholders and teammates quickly identify the health of the team’s products, allowing engineers to focus on areas where issues reside.
Establish Drift Monitoring: Design and implement automated data drift and concept drift monitoring systems that alert the team when models degrade, laying the groundwork for future Continuous Training (CT) architectures.

The ideal candidate has

7+ years of software engineering experience, with significant time spent operating ML or data-intensive systems in production at scale.
Deep experience with the modern ML production stack: model serving (e.g., Sagemaker, Triton, or equivalent), feature stores, model registries, and CI/CD for ML.
Strong infrastructure and platform engineering fundamentals: Kubernetes, containerization, cloud (AWS preferred), Terraform/IaC, observability, and incident response.
Experience designing ML platforms or significant components of one (not strictly consuming SaaS) and the judgment to know when to build vs. buy.
Strong collaboration with ML, data, platform engineers, data scientists, and product engineering teams, with the ability to set technical direction as the most senior MLOps voice in the org.
Healthcare, regulated-data, or other high-stakes production ML experience is a plus but not required.
A desire to be a part of a high-performing, mission-driven team that operates with intense urgency, a strong sense of individual accountability, and a commitment to authentic feedback.

Technologies we use

Python, Kubernetes, AWS, Sagemaker, Terraform, S3, Snowflake, Airflow, Datadog

Skills

PythonKubernetesAWSSageMakerTerraformS3SnowflakeAirflowDatadogTritonFeature StoresModel RegistriesCI/CD for MLModel ServingObservability

Similar roles at this salary range

All ML Engineering jobs →

Anthropic

Jun 17

Research Engineer/Research Scientist, Audio

Research Engineer/Scientist role focused on advancing audio capabilities in large language models, including training speech/audio models, developing codecs, and building conversational AI systems. Requires strong experience in audio ML research and engineering with JAX or PyTorch.

350k – 500kSan Francisco, CAML EngineeringHybrid7+ YOEJAXPyTorch

Garner Health

Jun 16

Staff Applied Scientist

Build and own algorithmic systems that evaluate providers, make recommendations, and optimize healthcare outcomes for cost, quality, and access. Requires 3+ years shipping data-driven algorithms to production.

260k – 382kNew York City, NYML EngineeringHybrid3+ YOESQLAWS

The Voleon Group

Jun 16

Senior Member of Research Staff, Optimization

Lead optimization research applying large-scale constrained optimization and ML to real-time trading decisions. Requires 5-10+ years experience, strong math/ML background, production coding skills, and PhD-level coursework.

300k – 325kBerkeley, CA +1ML EngineeringHybrid5+ YOEC++Python

Garner Health

Jun 16

Senior Machine Learning Operations Engineer

Build and operate production ML systems and platform components for healthcare technology, partnering with ML and data science teams on model deployment, observability, and reliability.

256k – 285kNew York City, NYML EngineeringHybrid5+ YOES3AWS

The Voleon Group

Jun 15

Member of Research Staff, Optimization

Conduct optimization research and implement large-scale constrained optimization models that drive real-time trading decisions, working across the full research lifecycle from theory to production. Requires PhD-level coursework and strong applied research background in optimization.

250k – 275kBerkeley, CA +1ML EngineeringHybrid7+ YOEC++Python

Apply