Senior Member of Technical Staff, AI Quality
The Senior Member of Technical Staff, AI Quality will build and operate evaluation frameworks for production LLM systems, focusing on creating robust regression suites and monitoring tools to ensure the quality and reliability of AI agents.
The Role
Harper operates like a factory with a series of modules spanning the full lifecycle from intake through renewals. Across them we run a stack of internal AI systems covering operator guidance, the operational backbone that matches risks to underwriters, autonomous communications, and voice AI for customer interactions.
Every one of those agents needs to be evaluated, regression-tested, and monitored in production. You'll work alongside the engineer setting the AI-quality direction and own a specific agent surface end-to-end.
What You'll Do
- Build capability + regression eval suites for assigned agents - intake, submissions, placements, renewals, CRM, or voice
- Curate golden datasets - Real failure modes from real customer transcripts, real underwriter back-and-forth, real call recordings. 20–50 quality cases per agent, not thousands of synthetic ones.
- Design graders - Deterministic first (string match, state check, tool-call assertions). LLM-as-judge where deterministic fails. Human calibration on samples.
- Ship pre-merge eval gates - Every PR touching an agent / prompt / tool runs the relevant suite in CI. Below threshold → blocked.
- Wire production trajectory monitoring - Online evaluators score live trajectories. Drift detection within hours.
- Convert ops findings into tests - Critique's flagged failures become regression cases. Every repeat issue becomes a permanent test.
You Might Be a Fit If…
- You've built or operated eval frameworks for production LLM systems
- You can describe a specific regression an eval suite you built caught - and how it would have leaked otherwise
- You've designed an LLM-as-judge rubric that survived human calibration
- You can debug a hallucination by reading transcripts, not aggregate dashboards
- You write code with AI daily and have strong opinions on which agent behaviors matter
- You're 3–6 years into your career
Requirements
- 3–6 years software engineering experience
- Production LLM / agent eval experience - capability + regression suite design, LLM-as-judge graders, golden datasets
- Familiarity with at least one major eval framework
- Strong written communication - eval rubric docs, failure-mode taxonomies
- Based in San Francisco or willing to relocate
Nice to Have
- Open-source contribution to eval frameworks
- Red-team / adversarial-testing experience for LLM systems
- Voice AI eval experience (latency, interruption handling, transcription accuracy)
- ML eval / observability background
Compensation
- OTE: $176,000–$253,000 cash compensation (base salary + target performance bonus)
- Equity: competitive equity, so you share in the company you are helping build
Benefits
- Health, dental, and vision insurance
- Commuter benefits
- Team meals and snacks
Senior Machine Learning Operations Engineer
Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.
AI Engineer, Evaluation
Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.
Senior AI Engineer
Senior Engineer building multi-agent AI systems, LLM integrations, and backend automation services that power Marketing Operations. Owns technical direction for agentic infrastructure connecting models to business systems.
Senior Machine Learning Engineer
Build and deploy cutting-edge Agentic AI and LLM systems to transform Airbnb's customer service experience, including Chat and Voice AI assistants. Requires 6+ years experience with production ML/AI systems at scale.