What You’ll Do
- Vision and Roadmap: Develop and execute the long term vision & roadmap for MLOPs team to support ML development and deployment needs across the business units. Successfully manage the tension between short-term tactical deliveries and long-term architectural transformation for future growth.
- Team Management: Lead and mentor a team of 6-7+ high-performing engineers. Strategically allocate resources to manage support for existing services while executing key strategic initiatives.
- Cross-Functional Collaboration: Partner with leaders across machine learning, data science, product engineering, and infrastructure to proactively identify pain points, address bottlenecks, and facilitate the deployment of new solutions.
- Foundation Model Readiness: Architect the compute and storage pipelines required for ML Engineers to manage millions of slides and complex derived artifacts without data fragmentation or synchronization latency.
- Inference Modernization: Modernize the AI Product inference stack to support 5-10x growth of AI runs across global deployments.
- System Observability: Collaborate with Site Reliability Engineering (SRE) to establish comprehensive metrics covering compute under-utilization, network bottlenecks, and granular cost and turn-around-time attribution.
- Technology Refresh: Conduct "Build vs. Buy" assessments, leading "Stack Refresh" audits to benchmark our proprietary tools against best-in-class commercial and open-source alternatives to meet our future needs.
What You Bring
Required:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
- 2-3+ years of experience managing engineering team(s), with a focus on building production-grade frameworks for MLOps or ML Infrastructure.
- Deep technical expertise with ML workloads on Kubernetes, cloud computing platforms (AWS, GCP, Azure), workflow orchestration (Airflow, Kubeflow), and DevOps principles and infrastructure-as-code (Helm, Terraform).
- Proven experience managing petabyte-scale datasets and high-throughput production inference pipelines.
- Strong software engineering skills in complex, multi-language systems and experience with scalable service architecture.
- Use of AI assistants (e.g. CoPilot, Cursor, Claude) across platform development lifecycle.
Nice-to-Haves:
- Exposure to ML frameworks like PyTorch or Scikit-learn.
- Experience with large-scale data processing frameworks (e.g. Spark, Hive, Databricks, Amazon EMR).
- Expertise in MLOps principles, including model lifecycle management, feature stores, model monitoring, and CI/CD for ML.
- Familiarity with security and compliance best practices in ML systems.
Compensation
Annual Pay Range: $181,500 - $278,300
Not Overtime Eligible
Eligible for Equity