Agent Post-Training, Artifacts Research

Train frontier models to generate polished artifacts (docs, spreadsheets, slides) by owning post-training improvements across RL, data, evals, and alignment. Requires strong ML fundamentals and hands-on LLM/RL experience.

295k – 445kSan Francisco, CAML EngineeringOnsite7+ YOE

Apply

About the role

Responsibilities

Design and run experiments that improve agentic model behavior for complex software and plugins.
Own end-to-end improvements to the post-training stack, including RL, data pipelines, graders, reward signals, evals, diagnostics, and model-behavior analysis.
Build evals and environments that expose the next set of model failures, then turn those failures into training data, product fixes, or new research directions.
Partner with Codex and ChatGPT product teams to understand what users need and translate product signal into model improvements.
Work on early-training and alignment interventions, including data mixtures, objectives, synthetic data, and eval loops that shape downstream agent behavior.
Help decide which integrations, capabilities, and fixes are ready for inclusion in major model runs.
Improve the machinery for large-scale training and launch: experiment velocity, reliability, observability, reproducibility, cost, latency, and production readiness.
Take on cross-functional projects that touch model training, product infrastructure, and the production agent harness, such as multi-agent systems or training directly against production-like environments.
Debug hard failures in shipped or near-shipped models and turn messy qualitative behavior into concrete hypotheses, experiments, and fixes.

Requirements

Strong technical fundamentals in machine learning, software engineering, systems, statistics, or a related field, and can learn quickly across the parts you have not worked in before.
Hands-on experience with LLMs, RL, RLHF/RLAIF, post-training, evals, graders, synthetic data, model training, coding agents, tool-using agents, or production ML systems.
Excited by open-ended problems where the path is unclear, the signal is noisy, and the right answer requires both research taste and engineering execution.
Care about product impact and model behavior, not just benchmark movement. Have opinions about what makes an agent useful, reliable, honest, tasteful, and easy to work with.
Can move from a vague behavioral problem to a concrete experiment: define the hypothesis, build the pipeline, run the model, analyze the result, and decide what to do next.
Comfortable working across research, product, infrastructure, data, evals, and safety boundaries, and can communicate clearly with each group.
Like building load-bearing systems and processes when that is what the team needs, even if the work is not glamorous.
Want to train and ship the models that make agents genuinely useful for developers, enterprises, researchers, and everyday users.

Nice-to-Haves

Prior background in consulting, finance, marketing, operations, or data science.

Skills

Machine LearningSoftware EngineeringStatisticsLLMsReinforcement LearningRLHFRlaifPost-TrainingEvalsGradersSynthetic DataModel TrainingCoding AgentsTool-Using AgentsProduction Ml Systems

Similar roles

ML Engineering jobs

OpenAI

Agent Post-Training, Computer Use Research

Train frontier models to operate computers, browsers, and desktops. Design experiments, build evals, own post-training pipelines (RL, data, graders), and ship improvements into OpenAI agents.

295k – 445kSan Francisco, CAML EngineeringOn-site7+ YOERLHFLLMs

OpenAI

Agent Post-Training, Connectors Research

Train frontier agents to interface with professional software via code, APIs, and structured integrations. Design experiments, own post-training improvements (RL, evals, data), and ship capabilities into major model runs.

295k – 445kSan Francisco, CAML EngineeringOn-site7+ YOERLHFLLMs

OpenAI

Context Researcher

Context Researcher on the Agent Post-Training team scaling compute on context for frontier agent models. Designs experiments, owns post-training improvements, builds evals, and ships capabilities into Codex and ChatGPT.

295k – 445kSan Francisco, CAML EngineeringOn-site7+ YOELLMsRLHF

OpenAI

Agent Post-Training, Personality

Help shape OpenAI agent personality by turning qualitative collaboration insights into evals, training data, reward signals, and model improvements that reach production.

295k – 445kSan Francisco, CAML EngineeringOn-site7+ YOERlHci

OpenAI

Research Engineer/Research Scientist

Research Engineer/Scientist improving model capabilities for personalized AI experiences. Focus on tool-use, instruction following, evaluations, and training improvements. Requires strong ML engineering and research experience.

295k – 555kSan Francisco, CAML EngineeringHybrid7+ YOEPythonResearch