AI Engineer, Quality (Evals)
Owns evaluation infrastructure for AI agents in audit workflows, building unified platforms, automated pipelines, observability, and feedback loops to ensure enterprise-scale reliability. Requires experience with LLMs, TypeScript/Python, and production AI systems.
What You'll Own
Measurable AI Agents
- Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows
- Build observability systems that surface agent behavior, trace execution, and failure modes in production, and feedback loops that turn production failures into first-class evaluation cases
- Own the evaluation infrastructure stack including integration with LangSmith and LangGraph
- Translate customer problems into concrete agent behaviors and workflows
- Integrate and orchestrate LLMs, tools, retrieval systems, and logic into cohesive, reliable agent experiences
Rapid Model Evaluation
- Build automated pipelines that evaluate new models against all critical workflows within hours of release
- Design evaluation harnesses for our most complex Agentic systems and workflows
- Implement comparison frameworks that measure effectiveness, consistency, latency, and cost across model versions
- Design guardrails and monitoring systems that catch quality regressions before they reach customers
AI-native engineering execution
- Use AI as core leverage in how you design, build, test, and iterate
- Prototype quickly to resolve uncertainty, then harden systems for enterprise-grade reliability
- Build evaluations, feedback mechanisms, and guardrails so agents improve over time
- Work with SMEs and ML Engineers to create evaluation datasets by curating production traces
- Design prompts, retrieval pipelines, and agent orchestration systems that perform reliably at scale
Ownership of Quality and Large Product Areas
- Define and document evaluation standards, best practices, and processes for the engineering organization
- Advocate for evaluation-driven development and make it easy for the team to write and run evals
- Partner with product and ML engineers to integrate evaluation requirements into agent development from day one
- Take full ownership of large product areas rather than executing on narrow tasks
Who You Are
You are an engineer who believes that evaluations are foundational to building reliable AI systems.
Experience
- Multiple years of experience shipping production software in complex, real-world systems
- Experience with TypeScript, React, Python, and Postgres
- Built and deployed LLM-powered features serving production traffic
- Implemented evaluation frameworks for model outputs and agent behaviors
- Designed observability or tracing infrastructure for AI/ML systems
- Worked with vector databases, embedding models, and RAG architectures
- Experience with evaluation platforms (LangSmith, Langfuse, or similar)
Benefits
- Competitive compensation packages with meaningful ownership
- Flexible PTO
- 401k
- Wellness benefits, including a bundle of free therapy sessions
- Technology & Work from Home reimbursement
- Flexible work schedules
Senior Machine Learning Operations Engineer
Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.
AI Engineer, Evaluation
Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.