Software Engineer, ML Infrastructure
160k – 241kMountain View, CAML EngineeringOnsite3+ YOE
Summary
Build and scale ML infrastructure platform for autonomous vehicle development, focusing on automated resource provisioning, high-performance workload scheduling, and petabyte-scale data processing pipelines.
About the role
Responsibilities
- Build and evolve the core ML infrastructure platform providing researchers and engineers seamless access to compute and data resources
- Scale automated Infrastructure-as-Code (IaC) pipelines to manage thousands of GPU/CPU nodes across diverse environments
- Design and optimize workload orchestration to maximize hardware utilization, minimize job wait times, and handle massive-scale distributed training
- Design robust pipelines for extraction and transformation of petabyte-scale sensor and telemetry data into ML-ready formats
- Implement robust feature caching and storage solutions to reduce redundant computations and ensure low-latency access to pre-computed features
- Contribute to a unified ML platform that abstracts complex cloud infrastructure for end-users
Requirements
- 3+ years of professional experience in ML Infrastructure, Backend Platform Engineering, or Distributed Systems
- Deep familiarity with modern Infrastructure-as-Code and provisioning tools such as Terraform, Pulumi, or Crossplane
- Hands-on experience building or managing large-scale orchestrators for compute-heavy workloads (e.g., Kubernetes, KubeRay, Ray, Slurm, or Volcano)
- Proficiency in at least one distributed processing framework, such as Apache Spark or Apache Beam, for large-scale data extraction and transformation
- Experience implementing or maintaining feature stores and caching layers (e.g., Feast, Hopsworks, or Redis-based custom caching)
- Strong understanding of distributed systems, networking, and storage bottlenecks in the context of high-performance computing
Nice-to-Haves
- Active contributor to open-source projects in the MLOps or Cloud-Native ecosystem (e.g., CNCF, Ray, or Kubeflow communities)
- Experience with high-performance storage systems (e.g., Lustre, Ceph, or specialized NVMe caching) for ML data loading
- Knowledge of cost-optimization strategies for large-scale GPU clusters in public clouds (AWS, GCP, or Azure)
Skills
TerraformPulumiCrossplaneKubernetesKubeRayRaySlurmVolcanoApache SparkApache BeamFeastHopsworksRedisInfrastructure as CodeDistributed Systems
Similar roles at this salary range
All ML Engineering jobs →Senior Machine Learning Operations Engineer
Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.
167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP
AI Engineer, Evaluation
Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.
150k – 250kSan Francisco, CA +1ML EngineeringHybrid2+ YOEPythonAI Systems