Senior Member of Technical Staff, AI Quality

176k – 253kSan Francisco, CAOnsite3+ YOEJun 1

Summary

The Senior Member of Technical Staff, AI Quality will build and operate evaluation frameworks for production LLM systems, focusing on creating robust regression suites and monitoring tools to ensure the quality and reliability of AI agents.

About the role

The Role

Harper operates like a factory with a series of modules spanning the full lifecycle from intake through renewals. Across them we run a stack of internal AI systems covering operator guidance, the operational backbone that matches risks to underwriters, autonomous communications, and voice AI for customer interactions.

Every one of those agents needs to be evaluated, regression-tested, and monitored in production. You'll work alongside the engineer setting the AI-quality direction and own a specific agent surface end-to-end.

What You'll Do

Build capability + regression eval suites for assigned agents - intake, submissions, placements, renewals, CRM, or voice
Curate golden datasets - Real failure modes from real customer transcripts, real underwriter back-and-forth, real call recordings. 20–50 quality cases per agent, not thousands of synthetic ones.
Design graders - Deterministic first (string match, state check, tool-call assertions). LLM-as-judge where deterministic fails. Human calibration on samples.
Ship pre-merge eval gates - Every PR touching an agent / prompt / tool runs the relevant suite in CI. Below threshold → blocked.
Wire production trajectory monitoring - Online evaluators score live trajectories. Drift detection within hours.
Convert ops findings into tests - Critique's flagged failures become regression cases. Every repeat issue becomes a permanent test.

You Might Be a Fit If…

You've built or operated eval frameworks for production LLM systems
You can describe a specific regression an eval suite you built caught - and how it would have leaked otherwise
You've designed an LLM-as-judge rubric that survived human calibration
You can debug a hallucination by reading transcripts, not aggregate dashboards
You write code with AI daily and have strong opinions on which agent behaviors matter
You're 3–6 years into your career

Requirements

3–6 years software engineering experience
Production LLM / agent eval experience - capability + regression suite design, LLM-as-judge graders, golden datasets
Familiarity with at least one major eval framework
Strong written communication - eval rubric docs, failure-mode taxonomies
Based in San Francisco or willing to relocate

Nice to Have

Open-source contribution to eval frameworks
Red-team / adversarial-testing experience for LLM systems
Voice AI eval experience (latency, interruption handling, transcription accuracy)
ML eval / observability background

Compensation

OTE: $176,000–$253,000 cash compensation (base salary + target performance bonus)
Equity: competitive equity, so you share in the company you are helping build

Benefits

Health, dental, and vision insurance
Commuter benefits
Team meals and snacks

Skills

LLM systemseval frameworksgolden datasetsLLM-as-judgeCIdrift detectionvoice AI evalML evalobservability

Similar roles at this salary range

All ML Engineering jobs →

Ironclad

Jun 18

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals

Mercury

Jun 18

Senior Machine Learning Operations Engineer

Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.

167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP

Distyl AI

Jun 18

AI Engineer, Evaluation

Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.

150k – 250kSan Francisco, CA +1ML EngineeringHybrid2+ YOEPythonAI Systems

Grafana Labs

Jun 18

Senior AI Engineer

Senior Engineer building multi-agent AI systems, LLM integrations, and backend automation services that power Marketing Operations. Owns technical direction for agentic infrastructure connecting models to business systems.

154k – 185kUnited StatesML EngineeringRemote8+ YOERAGGit

Airbnb

Jun 18

Senior Machine Learning Engineer

Build and deploy cutting-edge Agentic AI and LLM systems to transform Airbnb's customer service experience, including Chat and Voice AI assistants. Requires 6+ years experience with production ML/AI systems at scale.

196k – 227kUnited StatesML EngineeringRemote6+ YOELLMSFT

Apply