Post-Training Research Engineer

Build in-house tooling for post-training custom ML models using advanced techniques like RL and finetuning. Requires deep expertise in transformer training, PyTorch distributed systems, parallelism strategies, GPU performance optimization, and HPC platforms.

200k – 275kSan Francisco, CAML EngineeringHybrid

Apply

About the role

Responsibilities

Build in-house tooling to support post-training of custom models, including reinforcement learning, supervised finetuning, and in-house research techniques.
Train a wide spectrum of model architectures with various techniques efficiently and at scale.
Work across the stack: systems-level concepts like Kubernetes, cgroups, storage systems, and networking topologies; PyTorch distributed tensor computation; GPU kernels.

Requirements

Deep understanding of modern ML techniques and tools for training transformers.
Advanced experience in a tensor/array computation library like PyTorch, TensorFlow, Jax, or similar.
Detailed understanding of transformer training parallelism strategies like data parallelism, sharded data parallelism, tensor parallelism, pipeline parallelism, context parallelism.
Experience and knowledge to profile and improve the performance of a distributed GPU program in PyTorch or similar.
Ability to perform roofline analysis on a transformer training setup.
Willingness to dive into messy problems, work with researchers, derive specifications, and execute.
Familiarity with HPC and distributed computing platforms like Slurm, Ray, Kubernetes, Dask.
Familiarity with cluster networking technology like Infiniband, RoCE, GPUDirect.
Solid fundamentals in operating systems concepts like processes, files, kernel drivers, containerisation, and networking protocols.
Sense of creativity and willingness to ask difficult questions about approach, assumptions, and tooling choices.

Benefits

Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents.
Generous PTO policy including company wide Winter Break.
Paid parental leave.
Company-facilitated 401(k).
Exposure to a variety of ML startups.

Skills

PyTorchTensorFlowJAXKubernetesSlurmRayDaskInfiniBandRoceGpudirect

Similar roles

ML Engineering jobs

Snowflake

AI System Research and Development Engineer - Optimization

Develop and optimize GPU kernels and deep learning systems for LLM training and inference at Snowflake AI Research. Requires 5+ years in GPU/HPC optimization and strong proficiency in PyTorch, TensorFlow, JAX, and CUDA.

200k – 265kBellevue, WAML EngineeringOn-site5+ YOEJAXCUDA

Glean

Machine Learning Engineer, Enterprise Brain

Develop ML systems for the Enterprise Brain, focusing on proactive AI for task prediction, automation, and agentic workflows using LLMs and advanced techniques. Requires 3+ years ML experience, Python proficiency, and expertise in evaluation and production systems.

200k – 300kPalo Alto, CA +1ML EngineeringHybrid3+ YOELLMsPython

Cantina

Machine Learning Engineer, Images

Designs, fine-tunes, and deploys image generation models for photorealistic AI bots, optimizing for consistency, latency, and quality. Requires 5+ years software engineering, 2+ years production ML, and expertise in diffusion models like Stable Diffusion and PyTorch.

200k – 265kSan Francisco, CAML EngineeringRemote5+ YOEGCPAWS

Together AI

Research Engineer, Core ML

Research Engineer building production ML systems at the intersection of efficient inference, RL/post-training, and serving engines. Translates algorithms into scalable infrastructure improving latency, throughput, and model quality. Requires 3+ years ML systems experience and advanced degree.

200k – 280kSan Francisco, CAML EngineeringOn-site3+ YOEDpovLLM

Cinder

AI Engineer

Builds and deploys production-scale AI/ML systems using LLMs, from fine-tuning and evaluation to low-latency infrastructure. Requires 5+ years experience with PyTorch/TensorFlow, MLOps, AWS, and taking models to production at high-growth startups.

200k – 250kNew York, NYML EngineeringHybrid5+ YOERAGAWS