# Research Engineer, Evals
**Company:** [Variance](https://hotfix.jobs/companies/variance)
**Location:** San Francisco, CA
**Salary:** $250K-$400K
**Skills:** LLMs, Machine Learning, Benchmarks, Datasets, Evaluation Pipelines, Agent Systems, Retrieval, Post-Training, Python, Experimentation, Fraud Detection, Risk Assessment
**Posted:** 2026-03-31
> Build benchmarks, datasets, and evaluation systems to measure and improve AI model quality for fraud, identity, and risk judgment tasks. Collaborate across research, engineering, and product to drive rigorous experimentation and iteration in high-stakes environments.
## Job Description
## Responsibilities
- Build proprietary benchmarks and datasets to evaluate models and model systems on fraud, identity, and risk workflows
- Design and run offline and online evals that measure model performance on real customer tasks
- Define quality metrics for judgment systems, including precision, calibration, consistency, abstention, and failure handling
- Study where models and agents break, and turn those failures into better evals, datasets, and training loops
- Build reusable evaluation tools and quality building blocks
- Partner closely with research, engineering, product, and design to improve system quality
- Help create a strong culture of scientific experimentation, clear measurement, and continuous iteration

## Requirements
- Care deeply about craftsmanship and have strong opinions about model quality, measurement, and experimental rigor
- Want to work on core model and agent behavior
- Excited by defining what “good” looks like in messy, high-stakes environments
- Think in tight loops: hypothesis, benchmark design, evaluation, failure analysis, iteration
- Strong engineering fundamentals and like building robust systems around ambiguous research problems
- Thrive in environments where success criteria are initially underspecified
- Willing to review outputs, grade edge cases, curate datasets, and refine tasks

## Preferred
- Experience training, evaluating, or improving modern ML systems
- Strong programming skills and comfort working in research-heavy codebases
- Experience building benchmarks, datasets, evaluation pipelines, or quality systems
- Familiarity with LLMs, agent systems, retrieval, post-training, or adjacent areas
- Ability to design clean experiments and draw reliable conclusions from noisy results
- Strong engineering judgment and a bias toward building
- Interest in fraud, risk, trust and safety, compliance, or other regulated and adversarial domains

## Compensation & Benefits
- Competitive salary and meaningful equity
- Platinum-level medical, dental, and vision insurance
- Unlimited PTO, sick leave, and parental leave
- Up to $100 per month in reimbursement for personal health and wellness expenses
- 401(k) plan
**Apply:** https://hotfix.jobs/jobs/research-engineer-evals-at-variance-b26f895d-8d3e-45a8-8acd-55835c2721d3
**Canonical:** https://hotfix.jobs/jobs/research-engineer-evals-at-variance-b26f895d-8d3e-45a8-8acd-55835c2721d3