Member of Technical Staff — RL Research

New/recent PhD to own RL and post-training for large-scale omni models. Build and scale the full RL/post-training stack including rollout, optimization, reward modeling, and evaluation for real-time audiovisual AI.

250k – 350kSeattle, WAML EngineeringOnsiteEntry level

Apply

About the role

What You’ll Own

Build Nuance’s RL/post-training stack from 0→1: rollout generation, policy optimization, reward/reference model serving, data feedback loops, evaluation, checkpointing, observability, and debugging.
Develop and scale post-training methods such as PPO, GRPO, DPO, rejection sampling, RLHF/RLAIF, online RL, and model-based data improvement.
Design the systems abstractions that connect research ideas to production-scale RL runs: trainers, rollout workers, reward models, evaluators, data queues, experience buffers, and checkpoint promotion.
Build evaluation and feedback loops for omni behavior: turn-taking, interruption, timing, emotional response, audiovisual coherence, instruction following, and real-time interaction quality.
Optimize the end-to-end post-training loop across rollout throughput, serving latency, GPU utilization, policy update efficiency, queueing, checkpoint overhead, and research iteration speed.
Evolve the platform as algorithms, model architectures, reward definitions, data sources, and evaluation methods change.

What We’re Looking For

A PhD — completed, or in its final stretch — in ML, RL, or a related field, with research depth shown through publications, a strong lab/advisor, or substantial open-source work.
Solid understanding of RL/post-training methods: policy optimization, reward modeling, preference optimization, rejection sampling, KL control, evaluation, and data feedback loops.
Ability to reason about model behavior and training dynamics: reward hacking, unstable rewards, distribution shift, stale policies, mode collapse, over-optimization, noisy preferences, and evaluation mismatch.
Exposure to RL/post-training pipelines through research, internships, or open-source — with frameworks such as verl, ms-swift, OpenRLHF, or equivalent, and familiarity with rollout serving systems such as vLLM.
Strong software engineering fundamentals and the appetite to build real systems, not just prototypes.
Curiosity and adaptability toward new RL algorithms, model architectures, serving systems, evaluation methods, and research ideas.

Bonus Points

Hands-on experience with omni or multimodal post-training for audio-video-language models, especially long-context or real-time interactive systems.
Experience with PPO, GRPO, DPO, online RL, RLHF/RLAIF, reward modeling, preference data, synthetic data generation, or model-based data improvement.
Prior 0→1 experience building post-training systems, RL pipelines, agent training systems, evaluation platforms, or model improvement loops.
Experience with adjacent areas such as distributed pretraining, data infrastructure, inference serving, simulation, human/AI feedback collection, or evaluation infrastructure.
Publications or substantial open-source contributions in RL, post-training, alignment, evaluation, ML systems, or model behavior.

Compensation

$250,000 – $350,000 base salary, plus meaningful equity.

Benefits

HSA plan with ~$2,000 in annual company contributions.
15 days of PTO plus public holidays, and office closure for a full week at year-end.
Lunch, drinks, and snacks provided every workday.
Commuter benefits.
401(k) in progress.

Skills

Reinforcement LearningPpoDpoRLHFReward ModelingvLLMVerlOpenrlhfPolicy OptimizationDistributed Training

Similar roles

ML Engineering jobs

Axion

Staff Software Engineer, Agentic Platform

Senior individual contributor architecting and scaling agentic LLM systems that turn messy manufacturing data into reliable root-cause insights. Owns orchestration, retrieval, evaluation, and guardrails for non-deterministic production systems.

250k – 270kSan Francisco, CA +1ML EngineeringHybrid7+ YOEMcpObservability

The Voleon Group

Member of Research Staff, Optimization

Conduct optimization research and implement large-scale constrained optimization models that drive real-time trading decisions, working across the full research lifecycle from theory to production. Requires PhD-level coursework and strong applied research background in optimization.

250k – 275kBerkeley, CA +1ML EngineeringHybrid7+ YOEC++Python

Labelbox

Staff Software Engineer, AI Data Platform

Staff-level engineer building AI data platform infrastructure, eval systems, and agent-first tooling for frontier labs. Requires 4+ years shipping production systems, full-stack experience, and deep TypeScript/Python proficiency.

250k – 280kSan Francisco, CAML EngineeringHybrid4+ YOEGCPReact

Nuance Labs

Member of Technical Staff — Model Optimization and Inference

Optimize inference for real-time multimodal AI avatars. Specialize in LLM and diffusion model serving, KV cache strategies, quantization, and low-latency frameworks like vLLM and TensorRT-LLM.

250k – 350kSeattle, WAML EngineeringOn-site7+ YOEAwqvLLM

ClickUp

Staff AI Engineer - AI Product

Leads development of user-facing AI features using LLMs and AI models, integrating them into production for scalable, personalized experiences. Requires 5+ years engineering experience with Python/JS, databases, and AI orchestration expertise.

250k – 300kUnited StatesML EngineeringRemote5+ YOELLMsMySQL