Software Engineer, ML Infrastructure

160k – 241kMountain View, CAML EngineeringOnsite3+ YOEJun 16

Summary

Build and scale ML infrastructure platform for autonomous vehicle development, focusing on automated resource provisioning, high-performance workload scheduling, and petabyte-scale data processing pipelines.

About the role

Responsibilities

Build and evolve the core ML infrastructure platform providing researchers and engineers seamless access to compute and data resources
Scale automated Infrastructure-as-Code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments
Design and optimize workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training
Design robust pipelines for extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats
Implement robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features
Contribute to a unified ML platform that abstracts complex cloud infrastructure for end-users

Requirements

3+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems
Deep familiarity with modern Infrastructure-as-Code and provisioning tools such as Terraform, Pulumi, or Crossplane
Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes, KubeRay, Ray, Slurm, or Volcano)
Proficiency in at least one distributed processing framework, such as Apache Spark or Apache Beam, for large-scale data extraction and transformation
Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching)
Strong understanding of distributed systems, networking, and storage bottlenecks in the context of high-performance computing

Nice-to-Haves

Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities)
Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading
Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS, GCP, or Azure)

Skills

TerraformPulumiCrossplaneKubernetesKubeRayRaySlurmVolcanoApache SparkApache BeamFeastHopsworksRedisInfrastructure as CodeDistributed Systems

Similar roles at this salary range

All ML Engineering jobs →

Mem0

Jun 19

Senior Research Engineer

Own the end-to-end lifecycle of memory features for AI agents. Fine-tune models, implement research, build evaluations, and ship production systems with Engineering.

175k – 250kSan Francisco, CAML EngineeringOn-site7+ YOERAGvLLM

Mozilla

Jun 19

Senior Machine Learning Engineer

Senior ML Engineer focused on fine-tuning and deploying LLMs and generative AI features into Firefox, emphasizing privacy, latency, and user experience.

139k – 218kUnited StatesML EngineeringRemote4+ YOERayLangChain

Ironclad

Jun 18

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals

Mercury

Jun 18

Senior Machine Learning Operations Engineer

Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.

167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP

Distyl AI

Jun 18

AI Engineer, Evaluation

Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.

150k – 250kSan Francisco, CA +1ML EngineeringHybrid2+ YOEPythonAI Systems

Apply