Skip to content

Senior Software Engineer — LLM Post-Training Platform

200k – 288kBellevue, WAOnsite5+ YOE
Summary

Build and scale Snowflake's Cortex Training LLM post-training platform, handling distributed GPU scheduling, orchestration, and productionizing research for enterprise-scale model adaptation.

About the role

Responsibilities

  • Design and build across the full stack — from the public training APIs and SDK through the control plane to the GPU data plane.
  • Scale the distributed systems that make GPU compute serverless — multi-tenant scheduling, placement, and capacity-aware routing across regional GPU pools, with fault tolerance built in.
  • Drive end-to-end performance at scale — keep the training, inference, and RL loops fast and the data plane responsive under heavy concurrent load, with GPUs kept saturated.
  • Productionize research building blocks — partner with Snowflake Research to turn state-of-the-art training and inference techniques into reliable, composable components customers can run at enterprise scale.

Requirements

  • 5+ years building and shipping production ML systems
  • Strong distributed systems and infrastructure foundation — designing scalable, fault-tolerant services and operating them on Kubernetes in production
  • Familiarity with GPU and LLM infrastructure — e.g., PyTorch, DeepSpeed/FSDP, Ray, CUDA/NCCL, vLLM; able to debug across the data, infrastructure, and GPU layers
  • Demonstrated ability to harden complex systems for reliability, throughput, and cost efficiency
  • BS in Computer Science or a related field (MS/PhD a plus)

Nice-to-Haves

  • Hands-on LLM post-training / modeling experience — the strongest candidates pair deep infra skills with real post-training intuition
Skills
PyTorchDeepSpeedFSDPRayCUDANCCLvLLMKubernetesDistributed SystemsLLM Post-Training
Similar roles at this salary range
All ML Engineering jobs →
Coinbase

Staff Machine Learning Engineer

Staff ML Engineer leading end-to-end identity verification ML systems including document authenticity, face matching, liveness detection, GNN-based identity graphs, and behavioral risk models. Requires 8+ years production ML experience and domain expertise in biometrics or fraud detection.

218k – 257kUnited StatesML EngineeringRemote8+ YOENLPLLMs
Notable

AI Platform Engineer

Design, build, and maintain LLM integrations powering AI features. Own end-to-end delivery from requirements through production monitoring with focus on scalability and reliability.

170k – 205kSan Mateo, CAML EngineeringHybrid5+ YOEGKEHelm
Hinge Health

Staff Machine Learning Scientist

Own ML systems for send-time optimization, propensity modeling, and nudge decisions at consumer scale. Set experimentation standards and mentor a small ML team.

205k – 307kSan Francisco, CAML EngineeringHybrid7+ YOESQLdbt
Docker

Staff ML Engineer

Founding Staff ML Engineer building production ML systems for governance, security, and agentic platform capabilities at Docker. Owns architecture, data pipelines, evaluation, and model lifecycle while mentoring the growing team.

205k – 330kPalo Alto, CA +1ML EngineeringRemote8+ YOELLMsRetrieval
Reddit

Senior Research Engineer, Post-training & Evaluation

Own evaluation science and post-training methodology for Reddit's foundational LLMs. Define benchmarks, design model-as-a-judge systems, and set SFT recipes that turn base models into safe, Reddit-native endpoints.

230k – 322kUnited StatesML EngineeringRemote6+ YOESFTCPT