Skip to content

Senior Software Engineer, ML Infrastructure

194k – 291kMountain View, CAML EngineeringOnsite4+ YOE
Summary

Build and scale ML infrastructure platform for autonomous vehicle model development, focusing on automated resource provisioning, high-performance workload scheduling, and petabyte-scale data processing pipelines.

About the role

Responsibilities

  • Build and evolve the core ML infrastructure platform providing researchers and engineers seamless access to compute and data resources
  • Scale automated Infrastructure-as-Code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments
  • Design and optimize workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training
  • Design robust pipelines for extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats
  • Implement robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features
  • Contribute to a unified ML platform that abstracts complex cloud infrastructure for end-users

Requirements

  • 4+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems
  • Deep familiarity with modern Infrastructure-as-Code and provisioning tools such as Terraform, Pulumi, or Crossplane
  • Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes, KubeRay, Ray, Slurm, or Volcano)
  • Proficiency in at least one distributed processing framework, such as Apache Spark or Apache Beam, for large-scale data extraction and transformation
  • Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching)
  • Strong understanding of distributed systems, networking, and storage bottlenecks in the context of high-performance computing

Nice-to-Haves

  • Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities)
  • Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading
  • Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS, GCP, or Azure)
Skills
TerraformPulumiCrossplaneKubernetesRaySlurmApache SparkApache BeamFeastHopsworksRedisAWSGCPAzure
Similar roles at this salary range
All ML Engineering jobs →
Mem0

Senior Research Engineer

Own the end-to-end lifecycle of memory features for AI agents. Fine-tune models, implement research, build evaluations, and ship production systems with Engineering.

175k – 250kSan Francisco, CAML EngineeringOn-site7+ YOERAGvLLM
Ironclad

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals
Mercury

Senior Machine Learning Operations Engineer

Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.

167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP
Plaid

Machine Learning Engineer - Embedded Insights

Drive ML initiatives from concept to production on the Embedded Insights team. Identify opportunities, build and deploy models using Plaid's financial datasets, and partner with product teams to deliver scalable customer-facing intelligence products.

212k – 272kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLMLOps
Plaid

Machine Learning Engineer

Advance Plaid’s foundation models by developing novel architectures, pretraining objectives, and fine-tuning strategies. Work across the full ML stack from data engineering to production serving and monitoring.

212k – 272kSan Francisco, CA +2ML EngineeringHybrid1+ YOELLMsPython