Skip to content

Machine Learning Engineer (AI Platform Lead)

140k – 180kUnited StatesRemote5+ YOE
Summary

Build and scale ML compute infrastructure and distributed training pipelines for foundation models. Optimize GPU/CPU efficiency and data throughput for large-scale model training and inference.

About the role

Essential Responsibilities

  • Accountable for Artera’s ML compute infrastructure including scaling up Artera’s Foundation Model development by developing distributed training infrastructure and developer libraries.
  • Build and evolve the core libraries used by AI scientists to develop, launch, and monitor AI products.
  • Work with model developers to optimize GPU and CPU efficiency and data throughput of large-scale foundation models and downstream model training runs.
  • Optimize Artera’s ability to store and serve terabytes of digital pathology data efficiently for the use in serving large-scale training regimes.
  • Ensure that Artera’s observability infrastructure provides a clear picture of how to continue to optimize performance across our model landscape.

Experience Requirements

  • 5+ years of industry software engineering experience
  • 4+ years of industry experience using one of PyTorch, TensorFlow, or JAX in Python
  • 3+ years of industry experience building with AWS, Docker, and Kubernetes
  • 1+ years of industry experience optimizing large-scale, high data-throughput, distributed machine learning training pipelines

Desired

  • Experience in using ML orchestration frameworks such as Flyte, Ray, Kubeflow, Metaflow, MLFlow, Dagster, Argo Workflow or Prefect
  • Experience using Terraform, SqlAlchemy
  • Experience in multi-node and multi-gpu training
  • Experience deploying and maintaining infrastructure for machine learning training and production inference
  • Familiarity with TorchScript, ONNXRuntime, DeepSpeed, AWS Neuron or similar approaches to inference optimization
Skills
PyTorchTensorFlowJAXPythonAWSDockerKubernetesFlyteRayKubeflowMetaflowMLFlowDagsterArgo WorkflowPrefect
Similar roles at this salary range
All ML Engineering jobs →
Together AI

Systems Research Engineer Intern - GPU Programming

Intern developing and optimizing GPU-accelerated kernels for ML/AI applications. Requires strong GPU programming background (CUDA/Triton) and knowledge of performance optimization.

121k – 131kSan Francisco, CAML EngineeringOn-siteEntry levelCUDATriton
Together AI

Research Intern, Inference

Research intern on the Inference team building efficient serving systems for large foundation models. Focus on distributed inference, compiler-aware optimization, and novel inference-time strategies.

121k – 131kSan Francisco, CAML EngineeringOn-siteEntry levelJAXCUDA
Pinterest

Machine Learning Engineer II, Computer Vision Applied Science

Build and fine-tune vision-centric VLMs and generative models using Pinterest's visual-text datasets. Requires 2+ years industry computer vision experience and an M.S. or Ph.D.

139k – 286kSan Francisco, CAML EngineeringRemote2+ YOELLMsRLHF
Mariana Minerals

Staff Machine Learning Engineer

Staff ML Engineer setting technical direction for autonomous mineral refining using reinforcement learning and simulation. Owns modeling, validation, and deployment of control systems on live industrial equipment.

160k – 200kAnn Arbor, MIML EngineeringOn-site8+ YOESimulationDigital Twins
Mariana Minerals

Machine Learning Engineer

Build and deploy reinforcement learning models to autonomously control mineral refining facilities, optimizing recovery rates, energy use, and uptime in real operating plants.

120k – 160kAnn Arbor, MI +2ML EngineeringOn-siteEntry levelPythonDeep Learning