Agent Post-Training, Frontier Evals and Environments Research

Researcher building frontier RL environments, evaluations, and training signals to steer OpenAI's largest agent training runs and measure model capabilities.

295k – 445kSan Francisco, CAAI ResearchOnsite7+ YOE

Apply

About the role

Responsibilities

Create ambitious RL environments to push frontier models to their limits and measure model capabilities, skills, and behaviors
Develop new methodologies for automatically exploring model behavior
Dive deep into the science of measurement, including scalability, reliability, and variance of evaluation methodology
Help steer training for the largest training runs
Design scalable systems and processes to support continuous evaluation
Build self-improvement loops to automate model understanding

Requirements

Strong technical fundamentals in machine learning, software engineering, systems, statistics, or a related field
Hands-on experience with LLMs, RL, RLHF/RLAIF, post-training, evals, graders, synthetic data, model training, coding agents, tool-using agents, or production ML systems
Ability to move from a vague behavioral problem to a concrete experiment: define the hypothesis, build the pipeline, run the model, analyze the result, and decide next steps
Comfortable working across research, product, infrastructure, data, evals, and safety boundaries

Nice-to-Haves

Excitement for open-ended problems where the path is unclear and the signal is noisy
Care about product impact and model behavior beyond benchmark movement
Opinions about what makes an agent useful, reliable, honest, tasteful, and easy to work with
Willingness to build load-bearing systems and processes even when the work is not glamorous

Skills

Machine LearningSoftware EngineeringStatisticsLLMsReinforcement LearningRLHFRlaifPost-TrainingEvaluationsGradersSynthetic DataModel TrainingCoding AgentsTool-Using AgentsProduction Ml Systems

Similar roles

AI Research jobs

Anthropic

Research Engineer, Discovery

Builds large-scale infrastructure for AI scientist training, evaluation, and deployment, resolving bottlenecks in distributed systems for scientific AGI. Requires 6+ years in infrastructure engineering with expertise in ML stacks, containers, and data pipelines.

350k – 850kSan Francisco, CAAI ResearchHybrid6+ YOEJAXAWS

Luma AI

Applied Research Scientist / Engineer

Work as a fullstack applied researcher adapting multimodal video foundation models for production. Focus on controllability, personalization, and end-user quality using SFT, RL, and data-driven refinement.

200k – 450kNew York, NY +1AI ResearchHybrid7+ YOERlSft

Decagon

Senior Research Engineer, Voice + Speech

Lead development of models and algorithms for real-time voice agents, advancing speech understanding, naturalness, and production deployment in conversational AI. Requires 5+ years in AI/ML with experience deploying LLMs.

200k – 400kNew York, NYAI ResearchOn-site5+ YOELLMsPython

EliseAI

Senior Research Scientist

Leads end-to-end research initiatives in machine learning and large language models for conversational AI in housing and healthcare. Requires PhD plus 5+ years post-PhD experience, strong ML expertise, and Python proficiency.

200k – 320kSan Francisco, CAAI ResearchOn-site5+ YOERLLMs

EliseAI

Senior Research Scientist

Leads end-to-end research initiatives in machine learning and large language models for conversational AI in housing and healthcare. Requires PhD in relevant field plus 5+ years post-PhD experience, strong ML expertise, and Python proficiency.

200k – 320kNew York, NYAI ResearchOn-site5+ YOERLLMs