# Member of Technical Staff — RL Research
**Company:** [Nuance Labs](https://hotfix.jobs/companies/nuance-labs)
**Location:** Seattle, WA
**Salary:** $250K-$350K
**Experience:** 0+ years
**Skills:** Reinforcement Learning, Ppo, Dpo, RLHF, Reward Modeling, vLLM, Verl, Openrlhf, Policy Optimization, Distributed Training
**Posted:** 2026-06-11
> New/recent PhD to own RL and post-training for large-scale omni models. Build and scale the full RL/post-training stack including rollout, optimization, reward modeling, and evaluation for real-time audiovisual AI.
## Job Description
## What You’ll Own
- Build Nuance’s RL/post-training stack from 0→1: rollout generation, policy optimization, reward/reference model serving, data feedback loops, evaluation, checkpointing, observability, and debugging.
- Develop and scale post-training methods such as PPO, GRPO, DPO, rejection sampling, RLHF/RLAIF, online RL, and model-based data improvement.
- Design the systems abstractions that connect research ideas to production-scale RL runs: trainers, rollout workers, reward models, evaluators, data queues, experience buffers, and checkpoint promotion.
- Build evaluation and feedback loops for omni behavior: turn-taking, interruption, timing, emotional response, audiovisual coherence, instruction following, and real-time interaction quality.
- Optimize the end-to-end post-training loop across rollout throughput, serving latency, GPU utilization, policy update efficiency, queueing, checkpoint overhead, and research iteration speed.
- Evolve the platform as algorithms, model architectures, reward definitions, data sources, and evaluation methods change.

## What We’re Looking For
- A PhD — completed, or in its final stretch — in ML, RL, or a related field, with research depth shown through publications, a strong lab/advisor, or substantial open-source work.
- Solid understanding of RL/post-training methods: policy optimization, reward modeling, preference optimization, rejection sampling, KL control, evaluation, and data feedback loops.
- Ability to reason about model behavior and training dynamics: reward hacking, unstable rewards, distribution shift, stale policies, mode collapse, over-optimization, noisy preferences, and evaluation mismatch.
- Exposure to RL/post-training pipelines through research, internships, or open-source — with frameworks such as verl, ms-swift, OpenRLHF, or equivalent, and familiarity with rollout serving systems such as vLLM.
- Strong software engineering fundamentals and the appetite to build real systems, not just prototypes.
- Curiosity and adaptability toward new RL algorithms, model architectures, serving systems, evaluation methods, and research ideas.

## Bonus Points
- Hands-on experience with omni or multimodal post-training for audio-video-language models, especially long-context or real-time interactive systems.
- Experience with PPO, GRPO, DPO, online RL, RLHF/RLAIF, reward modeling, preference data, synthetic data generation, or model-based data improvement.
- Prior 0→1 experience building post-training systems, RL pipelines, agent training systems, evaluation platforms, or model improvement loops.
- Experience with adjacent areas such as distributed pretraining, data infrastructure, inference serving, simulation, human/AI feedback collection, or evaluation infrastructure.
- Publications or substantial open-source contributions in RL, post-training, alignment, evaluation, ML systems, or model behavior.

## Compensation
- $250,000 – $350,000 base salary, plus meaningful equity.

## Benefits
- HSA plan with ~$2,000 in annual company contributions.
- 15 days of PTO plus public holidays, and office closure for a full week at year-end.
- Lunch, drinks, and snacks provided every workday.
- Commuter benefits.
- 401(k) in progress.
**Apply:** https://hotfix.jobs/jobs/member-of-technical-staff-rl-research-at-nuance-labs-72e143f9-9ecc-422b-b67d-d6c97094fbaa
**Canonical:** https://hotfix.jobs/jobs/member-of-technical-staff-rl-research-at-nuance-labs-72e143f9-9ecc-422b-b67d-d6c97094fbaa