Skip to content

Machine Learning Infrastructure Engineer

150k – 350kRedwood City, CAML EngineeringOnsite4+ YOE
Summary

Designs, builds, and maintains ML training and serving infrastructure, providing support to research teams. Requires 4+ years in ML infrastructure, cloud platforms like Kubernetes and Google Cloud, and GPU experience.

About the role

Responsibilities

  • Provide infrastructure support to our ML research and product
  • Build tooling to diagnose cluster issues and hardware failures
  • Monitor deployments, manage experiments, and generally support our research
  • Maximize GPU allocation and utilization for both serving and training

Requirements

  • 4+ years of experience supporting the infrastructure within an ML environment
  • Experience in developing tools used to diagnose ML infrastructure problems and failures
  • Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)
  • Experience working with GPUs

Nice to Have

  • Experience with large GPU clusters and high-performance computing/networking
  • Experience with supporting large language model training
  • Experience with ML frameworks like Pytorch/TensorFlow/JAX
  • Experience with GPU kernel development
Skills
KubernetesGoogle CloudCompute EngineCloud StorageGPUsPyTorchTensorFlowJAXGPU kernel developmentlarge GPU clusters
Similar roles at this salary range
All ML Engineering jobs →
Zoox

Machine Learning Engineer - Simulation Framework

Machine Learning Engineer focused on GPU-based simulation frameworks, reinforcement learning, and bridging sim-to-real gaps for autonomous vehicle safety validation. Requires MS/PhD and strong C++/Python experience.

151k – 257kFoster City, CA +1ML EngineeringHybrid7+ YOEJAXC++
Talkiatry

Senior AI Engineer

Build full-stack AI systems including agentic workflows, RAG pipelines, and production infrastructure for mental healthcare applications. Requires 2+ years software engineering experience and 1+ year with LLMs or agentic AI.

170k – 195kUnited StatesML EngineeringRemote2+ YOERAGReact
Grafana Labs

Staff AI Engineer

Staff AI Engineer building and shipping LLM/agent-powered observability features for incident detection, triage, and resolution. Requires strong production software engineering experience plus practical GenAI/LLM application skills.

175k – 220kUnited StatesML EngineeringRemote7+ YOEAWSGCP
Grafana Labs

Senior AI Engineer

Build and ship AI-powered observability features using LLMs and agent workflows to help users detect, triage, and resolve incidents. Requires strong production software engineering experience plus practical GenAI application skills.

128k – 204kUnited StatesML EngineeringRemote5+ YOEAWSGCP
Pinterest

Staff Software Engineer, Trends Machine Learning Infrastructure

Lead technical direction for Pinterest's unified AI-powered Trends and Audience Insights platform. Architect scalable ML data pipelines and LLM capabilities while mentoring engineers and driving cross-team integrations.

177k – 365kSan Francisco, CAML EngineeringHybrid8+ YOELLMsCodex