Staff Machine Learning Systems Engineer

230k – 322kUnited StatesRemote8+ YOEMar 18

Summary

Leads development of large-scale ML platforms, focusing on MLOps, graph ML infrastructure, performance optimization, and distributed training pipelines. Requires 8+ years in ML infrastructure with expertise in Python, PyTorch, Kubernetes, Ray, and cloud tools.

About the role

What You’ll Do:

Design end-to-end model lifecycle patterns (MLOps) to boost velocity of development for ML engineers, including data preparation, model management, experiment tracking, and more
Zero-to-one development and support of a graph ML codebase and platform that abstracts away common patterns and enables greater model scalability and iteration
Collaborate with ML engineers on performance tuning, including improving model training time, efficiency, and GPU training costs in a large, distributed ML training environment
Optimize batch data processing within a data warehouse and with tools such as Apache Beam, Apache Spark, Ray Data, and more
Architect pipelines to build and maintain massive graph data structures on the order of billions of nodes and tens of billions of edges

Who You Might Be:

8+ years of experience in ML infrastructure, including model training and model deployments
Hands-on experience with ML optimization, including memory and GPU profiling
Deep experience with cloud-based technologies for supporting an ML platform, including tools like GCP BigQuery, Google Cloud Storage, infrastructure-as-code (Terraform), and more
Hands-on experience administering and integrating MLOps tools for experiment tracking, model serving, and model registries (e.g. MLflow or Wandb)
Proficiency with the common programming languages and frameworks of ML, such as Python, PyTorch, Tensorflow, etc.
Deep experience working with distributed training frameworks, including Ray and Kubernetes
Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle.
Strong organizational & communication skills
Experience working with graph databases (Neo4j, JanusGraph, TigerGraph) is a big plus
Experience working with graph neural networks (GNNs) and associated graph ML frameworks (PyTorch Geometric, Deep Graph Library) is a big plus

Skills

PythonPyTorchTensorFlowKubernetesRayMLflowWandbTerraformApache SparkApache BeamGCP BigQueryGoogle Cloud StoragePyTorch GeometricDeep Graph LibraryNeo4j

Similar roles at this salary range

All ML Engineering jobs →

Databricks

Jun 8

Staff Software Engineer, AI Runtime

Staff Software Engineer building and scaling Databricks' managed large-scale GPU training platform (AIR). Focus on distributed training performance, scheduling, fault tolerance, and developer experience for thousands of accelerators.

190k – 265kMountain View, CA +1ML EngineeringOn-siteFSDPRoCE

Airbnb

Jun 8

Senior Staff Machine Learning Engineer, Communication & Connectivity

Lead ML architecture and implementation for Airbnb's Messaging & Notifications, building recommendation engines, ranking systems, and LLM-powered experiences while mentoring engineers.

244k – 305kUnited StatesML EngineeringRemotePythonAI Systems

Traba

Jun 8

Staff Software Engineer

Founding Staff Applied Agent Engineer to architect and lead Traba's agentic platform, building production LLM/agent systems that integrate with customer WMS/TMS/ERP and drive industrial operations. Requires 7+ years engineering experience with 2+ years building production agent systems.

240k – 300kNew York, NY +1ML EngineeringOn-siteLLMKafka

Traba

Jun 8

Senior Software Engineer

Founding Senior Applied Agent Engineer building production LLM agent systems that automate supply chain workflows. Requires 5+ years engineering experience with 1+ year shipping LLM/agent features, strong Python/TypeScript skills, and hands-on agent stack experience.

200k – 240kNew York, NY +1ML EngineeringOn-sitePythonNode.js

Cribl

Jun 7

Staff Software Engineer, Cribl AI

Staff-level AI/ML engineer building and productionizing generative AI features across backend and frontend for Cribl's observability platform. Requires 6+ years experience, AI/ML and MLOps background, and TypeScript/JavaScript proficiency.

225k – 265kUnited StatesML EngineeringRemoteLLMsReact

Apply