Research Engineer – Evals

Builds evaluation systems to measure Firecrawl's web data extraction quality across diverse websites and workflows. Designs metrics, pipelines, benchmarks, and LLM judges; integrates into CI/CD and model training loops. Requires 3+ years in ML engineering or data quality with production systems.

160k – 240kSan Francisco, CAML EngineeringHybrid3+ YOE

Apply

About the role

What You’ll Do

Build the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape, crawl, extract, and map. That means defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD so regressions get caught before they ship.
Design benchmarks that reflect reality. Build benchmark datasets that cover the real distribution of what customers send, including edge cases.
Own LLM-as-judge pipelines. Design and validate automated judges that score extraction quality at scale, build human review tooling.
Close the loop with models and RL. Turn quality measurements into reward signals and feedback loops.
Run fast experiments and communicate clearly.

What We're Looking For

Builds their own eval infrastructure: pipelines, datasets, rubrics, judges.
Knows what "good" means for unstructured web data.
Fluent in LLM evaluation methodology: LLM-as-judge, rubrics, human review.
Production-minded: evals reflect real production behavior.
Fast and clear.

Backgrounds that tend to do well: ML engineers with eval/data quality systems, LLM fine-tuning/RLHF, data infra and model development.

Bonus Points: Experience at scraping/automation/security startup, ex-founder.

Skills

Llm EvaluationMl EngineeringPipelinesCI/CDRLHFBenchmark DatasetsLlm-As-JudgeData QualityHuman ReviewReward Signals

Similar roles

ML Engineering jobs

Pindrop

Research Scientist II

Research Scientist II building and improving fraud risk models and scam detection systems using audio, behavioral, and metadata signals. Requires an advanced degree and 3+ years of applied ML experience with Python and modern ML frameworks.

160k – 185kUnited StatesML EngineeringRemote3+ YOELLMsKeras

WorkWave

Applied Data Scientist / Machine Learning Engineer

Build and ship ML models (forecasting, recommendation, ranking, optimization) into customer-facing SaaS products. Own end-to-end pipelines from experimentation through production deployment, monitoring, and iteration.

160k – 170kUnited StatesML EngineeringRemote5+ YOESQLdbt

Reality Defender

Audio ML Engineer II

Build, tune, and deploy state-of-the-art audio deepfake detection models. Requires Master's/PhD and 3+ years experience with ML/DL frameworks and audio processing.

160k – 210kNew York, NYML EngineeringRemote3+ YOEJAXPython

Snowflake

AI Engineer - Database Engineering

As an AI Engineer, you will own the full AI engineering lifecycle, from design to optimization, for Snowflake Database Engineering products. You will build agentic workflows, coding harnesses, and evaluation pipelines, working with a high-powered engineering team.

160k – 230kMenlo Park, CAML EngineeringOn-site5+ YOEGodbt

Snowflake

Software Engineer, Cortex AI Infrastructure

Build and scale backend infrastructure powering agentic AI products including orchestration engines, RAG systems, evals infrastructure, and production AI workflows. Requires 4+ years distributed systems experience and deep Python plus Go/Java proficiency.

160k – 225kMenlo Park, CAML EngineeringOn-site4+ YOEGoRAG