Skip to content

Senior Member of Technical Staff, AI Quality

176k – 253kSan Francisco, CAOnsite3+ YOE
Summary

The Senior Member of Technical Staff, AI Quality will build and operate evaluation frameworks for production LLM systems, focusing on creating robust regression suites and monitoring tools to ensure the quality and reliability of AI agents.

About the role

The Role

Harper operates like a factory with a series of modules spanning the full lifecycle from intake through renewals. Across them we run a stack of internal AI systems covering operator guidance, the operational backbone that matches risks to underwriters, autonomous communications, and voice AI for customer interactions.

Every one of those agents needs to be evaluated, regression-tested, and monitored in production. You'll work alongside the engineer setting the AI-quality direction and own a specific agent surface end-to-end.

What You'll Do

  • Build capability + regression eval suites for assigned agents - intake, submissions, placements, renewals, CRM, or voice
  • Curate golden datasets - Real failure modes from real customer transcripts, real underwriter back-and-forth, real call recordings. 20–50 quality cases per agent, not thousands of synthetic ones.
  • Design graders - Deterministic first (string match, state check, tool-call assertions). LLM-as-judge where deterministic fails. Human calibration on samples.
  • Ship pre-merge eval gates - Every PR touching an agent / prompt / tool runs the relevant suite in CI. Below threshold → blocked.
  • Wire production trajectory monitoring - Online evaluators score live trajectories. Drift detection within hours.
  • Convert ops findings into tests - Critique's flagged failures become regression cases. Every repeat issue becomes a permanent test.

You Might Be a Fit If…

  • You've built or operated eval frameworks for production LLM systems
  • You can describe a specific regression an eval suite you built caught - and how it would have leaked otherwise
  • You've designed an LLM-as-judge rubric that survived human calibration
  • You can debug a hallucination by reading transcripts, not aggregate dashboards
  • You write code with AI daily and have strong opinions on which agent behaviors matter
  • You're 3–6 years into your career

Requirements

  • 3–6 years software engineering experience
  • Production LLM / agent eval experience - capability + regression suite design, LLM-as-judge graders, golden datasets
  • Familiarity with at least one major eval framework
  • Strong written communication - eval rubric docs, failure-mode taxonomies
  • Based in San Francisco or willing to relocate

Nice to Have

  • Open-source contribution to eval frameworks
  • Red-team / adversarial-testing experience for LLM systems
  • Voice AI eval experience (latency, interruption handling, transcription accuracy)
  • ML eval / observability background

Compensation

  • OTE: $176,000–$253,000 cash compensation (base salary + target performance bonus)
  • Equity: competitive equity, so you share in the company you are helping build

Benefits

  • Health, dental, and vision insurance
  • Commuter benefits
  • Team meals and snacks
Skills
LLM systemseval frameworksgolden datasetsLLM-as-judgeCIdrift detectionvoice AI evalML evalobservability
Similar roles at this salary range
All ML Engineering jobs →
Ironclad

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals
Mercury

Senior Machine Learning Operations Engineer

Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.

167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP
Distyl AI

AI Engineer, Evaluation

Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.

150k – 250kSan Francisco, CA +1ML EngineeringHybrid2+ YOEPythonAI Systems
Grafana Labs

Senior AI Engineer

Senior Engineer building multi-agent AI systems, LLM integrations, and backend automation services that power Marketing Operations. Owns technical direction for agentic infrastructure connecting models to business systems.

154k – 185kUnited StatesML EngineeringRemote8+ YOERAGGit
Airbnb

Senior Machine Learning Engineer

Build and deploy cutting-edge Agentic AI and LLM systems to transform Airbnb's customer service experience, including Chat and Voice AI assistants. Requires 6+ years experience with production ML/AI systems at scale.

196k – 227kUnited StatesML EngineeringRemote6+ YOELLMSFT