Staff Machine Learning Operations Engineer
Staff MLOps Engineer responsible for the reliability, performance, and cost-efficiency of production ML systems. Architect ML platform with feature stores, model registries, and automated CI/CD pipelines.
What you will do
- Own the reliability, performance, functionality, and cost-efficiency of Garner's production ML systems, including establishing SLOs, observability, and on-call responsibilities.
- Architect Garner's ML platform including required data infrastructure (including feature store, model registry and CI/CD for models), and standardized service patterns.
- Implement ML-specific CI/CD pipelines: Transition our deployment process from manual notebook hand-offs to automated, PR-driven CI/CD workflows that include automated data quality checks and statistical model validation prior to deployment.
- Drive down cost and latency through improved architecture, hardware choices, and model optimization as appropriate.
- Lay the foundation for a future Garner MLOps team, including workflows, standards, and KPIs that enables rapid teammate onboarding and helps stakeholders and teammates quickly identify the health of the team’s products, allowing engineers to focus on areas where issues reside.
- Establish Drift Monitoring: Design and implement automated data drift and concept drift monitoring systems that alert the team when models degrade, laying the groundwork for future Continuous Training (CT) architectures.
The ideal candidate has
- 7+ years of software engineering experience, with significant time spent operating ML or data-intensive systems in production at scale.
- Deep experience with the modern ML production stack: model serving (e.g., Sagemaker, Triton, or equivalent), feature stores, model registries, and CI/CD for ML.
- Strong infrastructure and platform engineering fundamentals: Kubernetes, containerization, cloud (AWS preferred), Terraform/IaC, observability, and incident response.
- Experience designing ML platforms or significant components of one (not strictly consuming SaaS) and the judgment to know when to build vs. buy.
- Strong collaboration with ML, data, platform engineers, data scientists, and product engineering teams, with the ability to set technical direction as the most senior MLOps voice in the org.
- Healthcare, regulated-data, or other high-stakes production ML experience is a plus but not required.
- A desire to be a part of a high-performing, mission-driven team that operates with intense urgency, a strong sense of individual accountability, and a commitment to authentic feedback.
Technologies we use
Python, Kubernetes, AWS, Sagemaker, Terraform, S3, Snowflake, Airflow, Datadog
Research Engineer/Research Scientist, Audio
Research Engineer/Scientist role focused on advancing audio capabilities in large language models, including training speech/audio models, developing codecs, and building conversational AI systems. Requires strong experience in audio ML research and engineering with JAX or PyTorch.
Staff Applied Scientist
Build and own algorithmic systems that evaluate providers, make recommendations, and optimize healthcare outcomes for cost, quality, and access. Requires 3+ years shipping data-driven algorithms to production.
Senior Member of Research Staff, Optimization
Lead optimization research applying large-scale constrained optimization and ML to real-time trading decisions. Requires 5-10+ years experience, strong math/ML background, production coding skills, and PhD-level coursework.
Senior Machine Learning Operations Engineer
Build and operate production ML systems and platform components for healthcare technology, partnering with ML and data science teams on model deployment, observability, and reliability.
Member of Research Staff, Optimization
Conduct optimization research and implement large-scale constrained optimization models that drive real-time trading decisions, working across the full research lifecycle from theory to production. Requires PhD-level coursework and strong applied research background in optimization.