Builds fast, reliable auto-evaluation infrastructure for AI research platform focused on pharma decision-making support. Owns backend systems, ML eval interfaces/dashboards, statistical reliability; requires 3+ years backend experience.
About the role
Responsibilities
Core auto-eval platform
- Build a comprehensive system that runs fast, is easy to use, and supports quickly building new evals:
- Speed: Build lightning-fast basic evals infrastructure that schedules tasks to introduce practically no latency; solve fundamental sources of latency (building a version of Elicit, running it on a query, and evaluating it using LMs).
- Interfaces: ML engineers need evals to kick off automatically on relevant commits, with results they can see at a glance and drill into. Product managers need dashboards showing performance over time and what's going wrong in production.
- Architecture: Ensure code is well-architected so other team members and ML engineers can understand and build on it.
Ensuring evaluations are accurate and reliable
- Evaluate how well Elicit helps with decision-making in pharma, encoding real knowledge about pharma customer decisions (e.g., choosing appropriate gold standards).
- Provide appropriate statistical tests and confidence intervals.
Time allocation
- 60% on core eval platform.
- 15% working with evals team to build and improve specific evals.
- 10% mentoring evals engineering intern.
- Rest on learning user interactions and understanding user needs.
Requirements
- At least 3 years of experience as a professional software engineer, with demonstrated experience building complex backend systems (e.g., backend for a complex website, data pipelines).
- Aptitude and interest in evaluating how Elicit helps with pharma decision-making.
Nice-to-haves
- Knowledge of statistics (e.g., calculating power and confidence intervals for evals).
- Experience with advanced Python (asyncio/trio and parallel processing strategies).
- Front-end experience and strong UX sensibility (building dashboards). TypeScript experience is a plus.
- Experience building developer tools.
- Previous experience as a data engineer or working on AI infrastructure.
- Knowledge of pharma/biomed.
- Experience evaluating ML systems.
- Experience building language-model-based systems.
Compensation
- Career (L3): $140-170k + equity.
- Senior (L4): $165-200k + equity.
Skills
PythonAsyncioTypeScriptStatisticsData PipelinesAI InfrastructureMl EvaluationLanguage ModelsDeveloper ToolsDashboards