Senior Software Engineer - ML Infrastructure

153k – 222kSunnyvale, CAOnsite3+ YOEApr 21

Summary

Builds distributed ML infrastructure including GPU training, end-to-end pipelines, and deployment platforms. Requires 3+ years experience in production ML systems, strong software engineering, and familiarity with open-source tools.

About the role

Responsibilities

Design and implement distributed cloud GPU training approaches for deep learning model training and evaluation
Build end-to-end machine learning pipelines and integrate them into core product workflows
Encourage change, especially in support of ML engineering best practices, and maintain a high standard of excellence
Collaborate with engineers across the entire company to solve complex data problems at scale

Requirements

Bachelor's degree in Computer Science, Software Engineering, or equivalent
3+ years of professional experience
Experience with building software components to address production, full-stack machine learning challenges
Opinions about building a company-wide platform for ML training, evaluation, and deployment
Knowledge of the open source landscape with judgment on when to choose open source versus build in-house
Excellent analytical and problem-solving skills

Nice to Have

Experience with developing, running, and managing orchestration systems like Airflow and Flyte that non-engineers can use to build data pipelines
Experience with ML modeling frameworks (PyTorch, Tensorflow, etc.), and model serving platforms (TorchServe, TensorFlow Serving, NVIDIA Triton inference server, etc.)

Compensation

Base salary range: $153,000 - $222,000 USD annually
Equity, comprehensive health/dental/vision insurance, 401k with employer match, learning/wellness stipends, paid time off

Skills

PyTorchTensorFlowAirflowFlyteKubernetesGPUDistributed TrainingMachine Learning PipelinesTorchServeNVIDIA Triton

Similar roles at this salary range

All ML Engineering jobs →

Databricks

Jun 8

Senior Software Engineer, AI Runtime

Senior Software Engineer building and scaling Databricks' managed GPU training platform (AI Runtime) for large-scale distributed AI model training. Requires 5+ years in distributed systems and hands-on experience with GPU training frameworks.

160k – 225kMountain View, CA +1ML EngineeringOn-siteFSDPRoCE

Jun 8

Sr. Machine Learning Engineer, Computer Vision

Build and prototype diffusion-based text-to-image generative models (Pinterest Canvas) using large-scale visual-text datasets. Requires 5+ years industry computer vision experience and an M.S. or Ph.D.

161k – 332kSan Francisco, CAML EngineeringRemoteRLHFPyTorch

Chime

Jun 8

AI/ML Engineer

Build and productionize ML models for risk detection and decisioning systems. Requires 1-2 years applied ML experience and familiarity with AWS, model evaluation, and experimentation.

125k – 173kSan Francisco, CAML EngineeringHybridAWSPython

Checkr

Jun 8

Machine Learning Engineer

Build and ship production ML/AI services powering background checks. Own end-to-end ML systems using LLMs, Python, and modern MLOps practices.

168k – 198kSan Francisco, CAML EngineeringOn-siteNLPdbt

Chime

Jun 8

Senior AI/ML Engineer

Senior AI/ML Engineer building transformer and deep learning models on financial and behavioral data to power personalized growth and marketing experiences at Chime. Requires strong production ML experience with PyTorch, AWS, and large-scale data infrastructure.

172k – 238kChicago, IL +3ML EngineeringHybridSQLAWS

Apply