Senior Research Engineer, Post-training & Evaluation

230k – 322kUnited StatesRemote6+ YOEJun 12

Summary

Own evaluation science and post-training methodology for Reddit's foundational LLMs. Define benchmarks, design model-as-a-judge systems, and set SFT recipes that turn base models into safe, Reddit-native endpoints.

About the role

Responsibilities

Define the "Reddit Benchmark" evaluation standard: Own the methodology for rigorously measuring model quality across Safety, Reasoning, representation/retrieval, and Reddit-specific knowledge.
Own evaluation reliability and statistical rigor: Establish the science behind trustworthy evals — judge variance, multi-sample scoring, inter-rater/inter-sample agreement, sampling and temperature effects, and calibration of automated judges.
Design model-as-a-judge methodology: Own judge selection, prompt design, calibration, and reliability for automated evaluation using frontier external models.
Set post-training recipes and strategy: Design SFT recipes (data mixtures, curriculum, ablation strategy) that convert base models into helpful, well-aligned endpoints.
Evaluate base and CPT checkpoints: Design checkpoint-selection methodology across CPT experiments and LR studies.
Drive synthetic data generation strategy: Define and curate high-quality instruction and evaluation sets to improve generalization where human data is scarce.
Partner with Safety Engineering: Translate high-level safety policy into concrete classification metrics, probe sets, and CI/CD unit tests.
Diagnose post-training instability: Dive into loss curves and eval logs to identify alignment tax and capability degradation.
Lead research direction: Set technical direction for evaluation and post-training across the team, mentor engineers and scientists.

Requirements

6+ years of professional ML experience (or PhD + 4+) with a direct focus on LLM post-training and evaluation.
PhD or MS in CS, ML, NLP, IR, or a related quantitative field — or equivalent industry research experience.
Deep expertise in evaluation reliability: judge/sample variance, multi-sample scoring, calibration, statistical significance, and the failure modes of automated evaluation.
Strong experience building custom, domain-specific evaluation harnesses (e.g., lm-eval-harness, Inspect AI, LightEval).
Experience evaluating both generation and representation/classification: model-as-a-judge for generative quality and precision/recall, PR-AUC, retrieval/MTEB-style metrics, gold-label denoising, and label-noise handling.
Deep understanding of Continuous Pre-training (CPT), Instruction Tuning (SFT), and how data quality shapes model behavior.
Fluency in Python; strong data-pipeline and eval-harness engineering (e.g., Hugging Face Transformers, vLLM, lm-eval-harness). Working knowledge of PyTorch and distributed training (FSDP2, DeepSpeed ZeRO-3).

Nice to Have

Experience with MLflow or similar experiment-tracking frameworks.
Familiarity with modern fine-tuning frameworks (Axolotl, TorchTune) and PyTorch-native training stacks (TorchTitan).
Synthetic data generation techniques (e.g., Self-Instruct).
Experience with preference optimization (DPO, RLHF, RLAIF, GRPO).
Publications in NLP/ML/FAccT or related venues, or other evidence of research leadership.
Experience evaluating multimodal models (embeddings, hateful-memes-style classification).

Skills

PythonPyTorchHugging Face TransformersvLLMlm-eval-harnessFSDP2DeepSpeed ZeRO-3SFTCPTRLHF

Similar roles at this salary range

All ML Engineering jobs →

Coinbase

Jun 12

Staff Machine Learning Engineer

Staff ML Engineer leading end-to-end identity verification ML systems including document authenticity, face matching, liveness detection, GNN-based identity graphs, and behavioral risk models. Requires 8+ years production ML experience and domain expertise in biometrics or fraud detection.

218k – 257kUnited StatesML EngineeringRemote8+ YOENLPLLMs

Hinge Health

Jun 12

Staff Machine Learning Scientist

Own ML systems for send-time optimization, propensity modeling, and nudge decisions at consumer scale. Set experimentation standards and mentor a small ML team.

205k – 307kSan Francisco, CAML EngineeringHybrid7+ YOESQLdbt

Docker

Jun 12

Staff ML Engineer

Founding Staff ML Engineer building production ML systems for governance, security, and agentic platform capabilities at Docker. Owns architecture, data pipelines, evaluation, and model lifecycle while mentoring the growing team.

205k – 330kPalo Alto, CA +1ML EngineeringRemote8+ YOELLMsRetrieval

Jun 12

Principal Engineer, AI Platform

Principal Engineer setting technical vision and building AI/ML infrastructure for Generative AI and Recommender Systems at Pinterest, scaling to hundreds of millions of inferences per second. Requires deep expertise in distributed systems and proven cross-org technical leadership.

243k – 500kSan Francisco, CAML EngineeringHybrid7+ YOEC++Java

Perplexity

Jun 12

Member of Technical Staff

Build and scale AI agents for Perplexity's Comet ecosystem and Perplexity Computer, working across applied research and engineering to advance agentic capabilities.

220k – 405kSan Francisco, CAML EngineeringOn-site5+ YOEGoCDP

Apply