Skip to content

Post-Training Research Engineer

Build in-house tooling for post-training custom ML models using advanced techniques like RL and finetuning. Requires deep expertise in transformer training, PyTorch distributed systems, parallelism strategies, GPU performance optimization, and HPC platforms.

200k – 275kSan Francisco, CAML EngineeringHybrid

About the role

Responsibilities

  • Build in-house tooling to support post-training of custom models, including reinforcement learning, supervised finetuning, and in-house research techniques.
  • Train a wide spectrum of model architectures with various techniques efficiently and at scale.
  • Work across the stack: systems-level concepts like Kubernetes, cgroups, storage systems, and networking topologies; PyTorch distributed tensor computation; GPU kernels.

Requirements

  • Deep understanding of modern ML techniques and tools for training transformers.
  • Advanced experience in a tensor/array computation library like PyTorch, TensorFlow, Jax, or similar.
  • Detailed understanding of transformer training parallelism strategies like data parallelism, sharded data parallelism, tensor parallelism, pipeline parallelism, context parallelism.
  • Experience and knowledge to profile and improve the performance of a distributed GPU program in PyTorch or similar.
  • Ability to perform roofline analysis on a transformer training setup.
  • Willingness to dive into messy problems, work with researchers, derive specifications, and execute.
  • Familiarity with HPC and distributed computing platforms like Slurm, Ray, Kubernetes, Dask.
  • Familiarity with cluster networking technology like Infiniband, RoCE, GPUDirect.
  • Solid fundamentals in operating systems concepts like processes, files, kernel drivers, containerisation, and networking protocols.
  • Sense of creativity and willingness to ask difficult questions about approach, assumptions, and tooling choices.

Benefits

  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents.
  • Generous PTO policy including company wide Winter Break.
  • Paid parental leave.
  • Company-facilitated 401(k).
  • Exposure to a variety of ML startups.

Skills

PyTorchTensorFlowJAXKubernetesSlurmRayDaskInfiniBandRoceGpudirect

Similar roles

ML Engineering jobs

AI System Research and Development Engineer - Optimization

Develop and optimize GPU kernels and deep learning systems for LLM training and inference at Snowflake AI Research. Requires 5+ years in GPU/HPC optimization and strong proficiency in PyTorch, TensorFlow, JAX, and CUDA.

200k – 265kBellevue, WAML EngineeringOn-site5+ YOEJAXCUDA

Machine Learning Engineer, Enterprise Brain

Develop ML systems for the Enterprise Brain, focusing on proactive AI for task prediction, automation, and agentic workflows using LLMs and advanced techniques. Requires 3+ years ML experience, Python proficiency, and expertise in evaluation and production systems.

200k – 300kPalo Alto, CA +1ML EngineeringHybrid3+ YOELLMsPython

Machine Learning Engineer, Images

Designs, fine-tunes, and deploys image generation models for photorealistic AI bots, optimizing for consistency, latency, and quality. Requires 5+ years software engineering, 2+ years production ML, and expertise in diffusion models like Stable Diffusion and PyTorch.

200k – 265kSan Francisco, CAML EngineeringRemote5+ YOEGCPAWS

Research Engineer, Core ML

Research Engineer building production ML systems at the intersection of efficient inference, RL/post-training, and serving engines. Translates algorithms into scalable infrastructure improving latency, throughput, and model quality. Requires 3+ years ML systems experience and advanced degree.

200k – 280kSan Francisco, CAML EngineeringOn-site3+ YOEDpovLLM

AI Engineer

Builds and deploys production-scale AI/ML systems using LLMs, from fine-tuning and evaluation to low-latency infrastructure. Requires 5+ years experience with PyTorch/TensorFlow, MLOps, AWS, and taking models to production at high-growth startups.

200k – 250kNew York, NYML EngineeringHybrid5+ YOERAGAWS