Skip to content

Staff Software Engineer - AI Agent Evaluations

218k – 271kMountain View, CAML EngineeringOnsite8+ YOE
Summary

Staff-level engineer defining and leading AI agent evaluation frameworks, building eval infrastructure, and establishing production observability for LLM-powered systems. Requires 8+ years experience and deep expertise in testing non-deterministic AI agents.

About the role

What You'll Do

  • Define AI Quality Standards: Own the framework for how ID.me evaluates, validates, and monitors AI agents — from prompt-based features to fully autonomous multi-step workflows.
  • Build Eval Infrastructure: Design and maintain evaluation pipelines for LLM outputs, agent behavior, tool use, and multi-turn interactions across development, staging, and production environments.
  • Production Observability for Agents: Instrument agentic systems for behavioral drift, regression, and failure modes that traditional metrics miss — latency, correctness, hallucination rate, tool misuse, and policy adherence.
  • Agentic Test Strategy: Lead the design of test suites that handle non-determinism — red-teaming agents, golden dataset construction, LLM-as-judge pipelines, and property-based testing for AI outputs.
  • Champion Developer Experience: Build the internal tooling, feedback loops, and testing workflows that make it fast and safe for engineers to develop and ship AI features with confidence. Reduce friction in the agent development inner loop — local testing, fast eval runs, and clear signal on regressions.
  • Drive AI-First Engineering Culture: Raise the quality bar across the engineering org by establishing patterns, tooling, and education for how teams write, test, and deploy AI features responsibly.
  • Cross-Team Collaboration: Partner with Security, Platform, Product, and AI/ML teams to embed quality gates into agent development workflows.
  • Mentorship: Guide senior and mid-level engineers through evaluation design, observability strategy, and testing approaches specific to AI systems.

Basic Qualifications

  • Bachelor's degree in Computer Science, Engineering, or equivalent experience
  • 8+ years building and operating production software systems
  • Demonstrated experience evaluating or testing LLM-powered features or autonomous agents in production
  • Proficiency with AI-assisted development tools (Claude Code, Cursor, or equivalent) — you build with AI every day
  • Strong backend engineering fundamentals in Python, Java, Go, or equivalent
  • Experience designing test infrastructure, CI/CD quality gates, or evaluation pipelines at scale
  • Experience improving developer experience — building internal tooling, reducing toil, or accelerating engineering workflows
  • Proven ability to lead cross-team technical initiatives and influence engineering standards
  • Strong written and verbal communication across engineering, product, and leadership
  • Experience building eval frameworks for LLM agents (e.g., correctness graders, LLM-as-judge, human-in-the-loop evals, benchmark dataset curation)
  • Familiarity with agentic frameworks (Claude API / Anthropic SDK, BrainTrust, LangChain, LangGraph, CrewAI, or similar)
  • Production monitoring experience for AI systems: behavioral drift detection, output sampling, shadow scoring
  • Red-teaming or adversarial testing experience for AI models or agents

Preferred Qualifications

  • Background in identity verification, fraud detection, or regulated industries
  • Familiarity with Anthropic's model evaluation methodology or similar published eval research
  • Experience with observability tooling (Datadog, OpenTelemetry) applied to AI workloads
  • Track record of building developer tooling or platforms that other teams adopt widely

Compensation & Benefits

  • Comprehensive medical, dental, vision, health savings account, flexible spending accounts (medical, limited purpose, dependent care, commuter benefit accounts)
  • Basic and voluntary life and AD&D insurance
  • 401(k) with company match
  • Parental leave
  • Unlimited paid time off subject to the terms and conditions of the PTO policy, including 8 company wide holidays
  • Short and long-term disability insurance
  • Accident and critical illness insurance
  • Referral bonus policy, employee assistance program, pet insurance, travel assistant program
  • Wellbeing and childcare discounts, benefit advocates, and a learning and development benefit
Skills
PythonJavaGoRAGLangChainLangGraphCrewAIClaude APIAnthropic SDKBrainTrustDatadogOpenTelemetryLLM evaluationCI/CDRed-teaming
Similar roles at this salary range
All ML Engineering jobs →
Mem0

Senior Research Engineer

Own the end-to-end lifecycle of memory features for AI agents. Fine-tune models, implement research, build evaluations, and ship production systems with Engineering.

175k – 250kSan Francisco, CAML EngineeringOn-site7+ YOERAGvLLM
Ironclad

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals
Plaid

Machine Learning Engineer - Embedded Insights

Drive ML initiatives from concept to production on the Embedded Insights team. Identify opportunities, build and deploy models using Plaid's financial datasets, and partner with product teams to deliver scalable customer-facing intelligence products.

212k – 272kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLMLOps
Plaid

Machine Learning Engineer

Advance Plaid’s foundation models by developing novel architectures, pretraining objectives, and fine-tuning strategies. Work across the full ML stack from data engineering to production serving and monitoring.

212k – 272kSan Francisco, CA +2ML EngineeringHybrid1+ YOELLMsPython
Airbnb

Senior Machine Learning Engineer

Build and deploy cutting-edge Agentic AI and LLM systems to transform Airbnb's customer service experience, including Chat and Voice AI assistants. Requires 6+ years experience with production ML/AI systems at scale.

196k – 227kUnited StatesML EngineeringRemote6+ YOELLMSFT