Applied AI Researcher, Benchmarking

150k – 250kSan Francisco, CANew York, NYHybridOct 16

Summary

Designs and constructs AI benchmarks and evaluation frameworks to measure reasoning, reliability, and real-world impact of intelligent systems. Requires experience with model evaluations, statistical rigor, building with AI models, and strong programming for prototypes.

About the role

Key Responsibilities

Design evaluation frameworks that capture reasoning depth, interaction quality, reliability, and operational impact.
Construct benchmarks that reflect real-world complexity to judge new architectures, techniques, and releases.
Explore new paradigms for evaluating intelligent systems: adversarial robustness testing, longitudinal performance tracking, and human-in-the-loop assessment.
Investigate how metrics shape model behavior and establish rigorous methodologies for quantifying emergent capability.

Who You Are (Requirements)

Experience designing and running evaluations: built or maintained benchmarks, test suites, or experimental frameworks.
Statistical and analytical rigor: design fair, reproducible experiments and extract signal from noisy results.
Experience building with models (compound AI systems, agentic collaboration, ensembling, ReAct, graph-of-thoughts, etc.).
Proven track record of research results (publications, public work).
Uses AI every day (ChatGPT, Cursor, Perplexity).
Strong programming and data analysis skills for prototypes and experiments.
Biases towards showing vs telling.

Compensation & Benefits

Base salary: $150K – $250K (depending on experience, location, level).
Meaningful equity.
100% covered medical, dental, vision for employees/dependents.
401(k), commuter benefits, in-office lunch.
Access to state-of-the-art models and AI tools.

Skills

AI benchmarksevaluation frameworksLLM evaluationReActgraph-of-thoughtsensemblingPythondata analysiscompound AI systemsagentic systems

Similar roles at this salary range

All AI Research jobs →

Snowflake

Jun 16

AI Research Scientist, New Grad – Agents & Reinforcement Learning

Conduct research on autonomous AI agents and reinforcement learning to build self-improving systems that reason, code, and learn at scale within the Snowflake Data Cloud. Requires a PhD (or equivalent) and strong expertise in RL and agentic AI.

176k – 230kBellevue, WAAI ResearchOn-siteEntry levelJAXDPO

Together AI

Jun 12

Frontier Agents Intern

Research intern on the Agents team building and aligning frontier AI systems for complex agentic and scientific tasks. Focus on post-training methods, evaluation frameworks, self-learning, and scalable agent infrastructure.

121k – 131kSan Francisco, CAAI ResearchOn-siteEntry levelJAXNLP

Snowflake

Jun 11

Post-Doctoral Researcher

Post-doctoral researcher conducting independent and collaborative AI/ML research focused on high-impact domains like medicine, finance, and law. Requires a recent or imminent PhD and publications in top venues.

160k – 220kBellevue, WAAI ResearchHybridEntry levelJAXRAG

SpotOn

Jun 8

Senior Software Engineer - Python/Typescript

Senior engineer building AI-driven automation systems to replace manual business workflows across operations, sales, and support. Requires 7+ years experience, production Python/TypeScript skills, and 1-2 years building agentic AI systems.

160k – 190kChicago, IL +3AI ResearchHybrid7+ YOEAWSLLMs

Datology AI

Jun 4

Research Engineer

As a Research Engineer, you will conduct and enable cutting-edge research, translating it into the core product pipeline. You will develop and improve state-of-the-art data curation strategies, accelerating research and ensuring product innovation.

180k – 300kRedwood City, CAAI ResearchOn-site4+ YOEML ModelsAI Models

Apply