AI Engineer, Quality (Evals)

170k – 220kSan Francisco, CAML EngineeringRemote3+ YOEApr 20

Summary

Owns evaluation infrastructure for AI agents in audit workflows, building unified platforms, automated pipelines, observability, and feedback loops to ensure enterprise-scale reliability. Requires experience with LLMs, TypeScript/Python, and production AI systems.

About the role

What You'll Own

Measurable AI Agents

Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows
Build observability systems that surface agent behavior, trace execution, and failure modes in production, and feedback loops that turn production failures into first-class evaluation cases
Own the evaluation infrastructure stack including integration with LangSmith and LangGraph
Translate customer problems into concrete agent behaviors and workflows
Integrate and orchestrate LLMs, tools, retrieval systems, and logic into cohesive, reliable agent experiences

Rapid Model Evaluation

Build automated pipelines that evaluate new models against all critical workflows within hours of release
Design evaluation harnesses for our most complex Agentic systems and workflows
Implement comparison frameworks that measure effectiveness, consistency, latency, and cost across model versions
Design guardrails and monitoring systems that catch quality regressions before they reach customers

AI-native engineering execution

Use AI as core leverage in how you design, build, test, and iterate
Prototype quickly to resolve uncertainty, then harden systems for enterprise-grade reliability
Build evaluations, feedback mechanisms, and guardrails so agents improve over time
Work with SMEs and ML Engineers to create evaluation datasets by curating production traces
Design prompts, retrieval pipelines, and agent orchestration systems that perform reliably at scale

Ownership of Quality and Large Product Areas

Define and document evaluation standards, best practices, and processes for the engineering organization
Advocate for evaluation-driven development and make it easy for the team to write and run evals
Partner with product and ML engineers to integrate evaluation requirements into agent development from day one
Take full ownership of large product areas rather than executing on narrow tasks

Who You Are

You are an engineer who believes that evaluations are foundational to building reliable AI systems.

Experience

Multiple years of experience shipping production software in complex, real-world systems
Experience with TypeScript, React, Python, and Postgres
Built and deployed LLM-powered features serving production traffic
Implemented evaluation frameworks for model outputs and agent behaviors
Designed observability or tracing infrastructure for AI/ML systems
Worked with vector databases, embedding models, and RAG architectures
Experience with evaluation platforms (LangSmith, Langfuse, or similar)

Benefits

Competitive compensation packages with meaningful ownership
Flexible PTO
401k
Wellness benefits, including a bundle of free therapy sessions
Technology & Work from Home reimbursement
Flexible work schedules

Skills

PythonTypeScriptReactPostgresLangSmithLangGraphLLMsRAGVector DatabasesObservabilityML EngineeringPrompt EngineeringAgentic Systems

Similar roles at this salary range

All ML Engineering jobs →

Mem0

Jun 19

Senior Research Engineer

Own the end-to-end lifecycle of memory features for AI agents. Fine-tune models, implement research, build evaluations, and ship production systems with Engineering.

175k – 250kSan Francisco, CAML EngineeringOn-site7+ YOERAGvLLM

Mozilla

Jun 19

Senior Machine Learning Engineer

Senior ML Engineer focused on fine-tuning and deploying LLMs and generative AI features into Firefox, emphasizing privacy, latency, and user experience.

139k – 218kUnited StatesML EngineeringRemote4+ YOERayLangChain

Ironclad

Jun 18

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals

Mercury

Jun 18

Senior Machine Learning Operations Engineer

Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.

167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP

Distyl AI

Jun 18

AI Engineer, Evaluation

Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.

150k – 250kSan Francisco, CA +1ML EngineeringHybrid2+ YOEPythonAI Systems

Apply