Skip to content

Agent Post-Training, Artifacts Research

Train frontier models to generate polished artifacts (docs, spreadsheets, slides) by owning post-training improvements across RL, data, evals, and alignment. Requires strong ML fundamentals and hands-on LLM/RL experience.

295k – 445kSan Francisco, CAML EngineeringOnsite7+ YOE

About the role

Responsibilities

  • Design and run experiments that improve agentic model behavior for complex software and plugins.
  • Own end-to-end improvements to the post-training stack, including RL, data pipelines, graders, reward signals, evals, diagnostics, and model-behavior analysis.
  • Build evals and environments that expose the next set of model failures, then turn those failures into training data, product fixes, or new research directions.
  • Partner with Codex and ChatGPT product teams to understand what users need and translate product signal into model improvements.
  • Work on early-training and alignment interventions, including data mixtures, objectives, synthetic data, and eval loops that shape downstream agent behavior.
  • Help decide which integrations, capabilities, and fixes are ready for inclusion in major model runs.
  • Improve the machinery for large-scale training and launch: experiment velocity, reliability, observability, reproducibility, cost, latency, and production readiness.
  • Take on cross-functional projects that touch model training, product infrastructure, and the production agent harness, such as multi-agent systems or training directly against production-like environments.
  • Debug hard failures in shipped or near-shipped models and turn messy qualitative behavior into concrete hypotheses, experiments, and fixes.

Requirements

  • Strong technical fundamentals in machine learning, software engineering, systems, statistics, or a related field, and can learn quickly across the parts you have not worked in before.
  • Hands-on experience with LLMs, RL, RLHF/RLAIF, post-training, evals, graders, synthetic data, model training, coding agents, tool-using agents, or production ML systems.
  • Excited by open-ended problems where the path is unclear, the signal is noisy, and the right answer requires both research taste and engineering execution.
  • Care about product impact and model behavior, not just benchmark movement. Have opinions about what makes an agent useful, reliable, honest, tasteful, and easy to work with.
  • Can move from a vague behavioral problem to a concrete experiment: define the hypothesis, build the pipeline, run the model, analyze the result, and decide what to do next.
  • Comfortable working across research, product, infrastructure, data, evals, and safety boundaries, and can communicate clearly with each group.
  • Like building load-bearing systems and processes when that is what the team needs, even if the work is not glamorous.
  • Want to train and ship the models that make agents genuinely useful for developers, enterprises, researchers, and everyday users.

Nice-to-Haves

  • Prior background in consulting, finance, marketing, operations, or data science.

Skills

Machine LearningSoftware EngineeringStatisticsLLMsReinforcement LearningRLHFRlaifPost-TrainingEvalsGradersSynthetic DataModel TrainingCoding AgentsTool-Using AgentsProduction Ml Systems

Similar roles

ML Engineering jobs

Agent Post-Training, Computer Use Research

Train frontier models to operate computers, browsers, and desktops. Design experiments, build evals, own post-training pipelines (RL, data, graders), and ship improvements into OpenAI agents.

295k – 445kSan Francisco, CAML EngineeringOn-site7+ YOERLHFLLMs

Agent Post-Training, Connectors Research

Train frontier agents to interface with professional software via code, APIs, and structured integrations. Design experiments, own post-training improvements (RL, evals, data), and ship capabilities into major model runs.

295k – 445kSan Francisco, CAML EngineeringOn-site7+ YOERLHFLLMs

Context Researcher

Context Researcher on the Agent Post-Training team scaling compute on context for frontier agent models. Designs experiments, owns post-training improvements, builds evals, and ships capabilities into Codex and ChatGPT.

295k – 445kSan Francisco, CAML EngineeringOn-site7+ YOELLMsRLHF

Agent Post-Training, Personality

Help shape OpenAI agent personality by turning qualitative collaboration insights into evals, training data, reward signals, and model improvements that reach production.

295k – 445kSan Francisco, CAML EngineeringOn-site7+ YOERlHci

Research Engineer/Research Scientist

Research Engineer/Scientist improving model capabilities for personalized AI experiences. Focus on tool-use, instruction following, evaluations, and training improvements. Requires strong ML engineering and research experience.

295k – 555kSan Francisco, CAML EngineeringHybrid7+ YOEPythonResearch