What You’ll Do
- Build the eval stack from scratch. Design and own the systems that measure whether Firecrawl's outputs are actually good — across scrape, crawl, extract, and map. That means defining metrics, building pipelines, curating datasets, and integrating evals into CI/CD so regressions get caught before they ship.
- Design benchmarks that reflect reality. Build benchmark datasets that cover the real distribution of what customers send, including edge cases.
- Own LLM-as-judge pipelines. Design and validate automated judges that score extraction quality at scale, build human review tooling.
- Close the loop with models and RL. Turn quality measurements into reward signals and feedback loops.
- Run fast experiments and communicate clearly.
What We're Looking For
- Builds their own eval infrastructure: pipelines, datasets, rubrics, judges.
- Knows what "good" means for unstructured web data.
- Fluent in LLM evaluation methodology: LLM-as-judge, rubrics, human review.
- Production-minded: evals reflect real production behavior.
- Fast and clear.
Backgrounds that tend to do well: ML engineers with eval/data quality systems, LLM fine-tuning/RLHF, data infra and model development.
Bonus Points: Experience at scraping/automation/security startup, ex-founder.