Senior Research Engineer, Post-training & Evaluation
Own evaluation science and post-training methodology for Reddit's foundational LLMs. Define benchmarks, design model-as-a-judge systems, and set SFT recipes that turn base models into safe, Reddit-native endpoints.
Responsibilities
- Define the "Reddit Benchmark" evaluation standard: Own the methodology for rigorously measuring model quality across Safety, Reasoning, representation/retrieval, and Reddit-specific knowledge.
- Own evaluation reliability and statistical rigor: Establish the science behind trustworthy evals — judge variance, multi-sample scoring, inter-rater/inter-sample agreement, sampling and temperature effects, and calibration of automated judges.
- Design model-as-a-judge methodology: Own judge selection, prompt design, calibration, and reliability for automated evaluation using frontier external models.
- Set post-training recipes and strategy: Design SFT recipes (data mixtures, curriculum, ablation strategy) that convert base models into helpful, well-aligned endpoints.
- Evaluate base and CPT checkpoints: Design checkpoint-selection methodology across CPT experiments and LR studies.
- Drive synthetic data generation strategy: Define and curate high-quality instruction and evaluation sets to improve generalization where human data is scarce.
- Partner with Safety Engineering: Translate high-level safety policy into concrete classification metrics, probe sets, and CI/CD unit tests.
- Diagnose post-training instability: Dive into loss curves and eval logs to identify alignment tax and capability degradation.
- Lead research direction: Set technical direction for evaluation and post-training across the team, mentor engineers and scientists.
Requirements
- 6+ years of professional ML experience (or PhD + 4+) with a direct focus on LLM post-training and evaluation.
- PhD or MS in CS, ML, NLP, IR, or a related quantitative field — or equivalent industry research experience.
- Deep expertise in evaluation reliability: judge/sample variance, multi-sample scoring, calibration, statistical significance, and the failure modes of automated evaluation.
- Strong experience building custom, domain-specific evaluation harnesses (e.g., lm-eval-harness, Inspect AI, LightEval).
- Experience evaluating both generation and representation/classification: model-as-a-judge for generative quality and precision/recall, PR-AUC, retrieval/MTEB-style metrics, gold-label denoising, and label-noise handling.
- Deep understanding of Continuous Pre-training (CPT), Instruction Tuning (SFT), and how data quality shapes model behavior.
- Fluency in Python; strong data-pipeline and eval-harness engineering (e.g., Hugging Face Transformers, vLLM, lm-eval-harness). Working knowledge of PyTorch and distributed training (FSDP2, DeepSpeed ZeRO-3).
Nice to Have
- Experience with MLflow or similar experiment-tracking frameworks.
- Familiarity with modern fine-tuning frameworks (Axolotl, TorchTune) and PyTorch-native training stacks (TorchTitan).
- Synthetic data generation techniques (e.g., Self-Instruct).
- Experience with preference optimization (DPO, RLHF, RLAIF, GRPO).
- Publications in NLP/ML/FAccT or related venues, or other evidence of research leadership.
- Experience evaluating multimodal models (embeddings, hateful-memes-style classification).
Staff Machine Learning Engineer
Staff ML Engineer leading end-to-end identity verification ML systems including document authenticity, face matching, liveness detection, GNN-based identity graphs, and behavioral risk models. Requires 8+ years production ML experience and domain expertise in biometrics or fraud detection.
Staff ML Engineer
Founding Staff ML Engineer building production ML systems for governance, security, and agentic platform capabilities at Docker. Owns architecture, data pipelines, evaluation, and model lifecycle while mentoring the growing team.
Principal Engineer, AI Platform
Principal Engineer setting technical vision and building AI/ML infrastructure for Generative AI and Recommender Systems at Pinterest, scaling to hundreds of millions of inferences per second. Requires deep expertise in distributed systems and proven cross-org technical leadership.