Skip to content

Software Engineer, ML Infrastructure

160k – 241kMountain View, CAML EngineeringOnsite3+ YOE
Summary

Build and scale ML infrastructure platform for autonomous vehicle development, focusing on automated resource provisioning, high-performance workload scheduling, and petabyte-scale data processing pipelines.

About the role

Responsibilities

  • Build and evolve the core ML infrastructure platform providing researchers and engineers seamless access to compute and data resources
  • Scale automated Infrastructure-as-Code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments
  • Design and optimize workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training
  • Design robust pipelines for extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats
  • Implement robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features
  • Contribute to a unified ML platform that abstracts complex cloud infrastructure for end-users

Requirements

  • 3+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems
  • Deep familiarity with modern Infrastructure-as-Code and provisioning tools such as Terraform, Pulumi, or Crossplane
  • Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes, KubeRay, Ray, Slurm, or Volcano)
  • Proficiency in at least one distributed processing framework, such as Apache Spark or Apache Beam, for large-scale data extraction and transformation
  • Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching)
  • Strong understanding of distributed systems, networking, and storage bottlenecks in the context of high-performance computing

Nice-to-Haves

  • Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities)
  • Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading
  • Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS, GCP, or Azure)
Skills
TerraformPulumiCrossplaneKubernetesKubeRayRaySlurmVolcanoApache SparkApache BeamFeastHopsworksRedisInfrastructure as CodeDistributed Systems
Similar roles at this salary range
All ML Engineering jobs →
Mem0

Senior Research Engineer

Own the end-to-end lifecycle of memory features for AI agents. Fine-tune models, implement research, build evaluations, and ship production systems with Engineering.

175k – 250kSan Francisco, CAML EngineeringOn-site7+ YOERAGvLLM
Mozilla

Senior Machine Learning Engineer

Senior ML Engineer focused on fine-tuning and deploying LLMs and generative AI features into Firefox, emphasizing privacy, latency, and user experience.

139k – 218kUnited StatesML EngineeringRemote4+ YOERayLangChain
Ironclad

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals
Mercury

Senior Machine Learning Operations Engineer

Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.

167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP
Distyl AI

AI Engineer, Evaluation

Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.

150k – 250kSan Francisco, CA +1ML EngineeringHybrid2+ YOEPythonAI Systems