AI Engineer, Evaluation

150k – 250kSan Francisco, CANew York, NYHybrid2+ YOEJun 18

Summary

Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.

About the role

Key Responsibilities

Design and implement evaluation frameworks that enable Evaluation-Driven Development for AI systems deployed in customer environments
Define how system quality is measured in each domain, ensuring that evaluation signals reflect real user needs, domain constraints, and business objectives
Build and maintain golden test cases and regression suites in Python, using both human-authored and AI-assisted test generation to capture critical behaviors and edge cases
Develop and maintain evaluation pipelines—offline and online—that integrate directly into system iteration loops
Define, calibrate, and operate LLM-based graders, aligning automated judgments with expert human assessments
Work closely with Forward Deployed AI Engineers, Architects, Product Engineers, AI Strategists, and domain experts

What We Require

2+ years of software engineering experience
Strong Python Engineering Skills: Write clean, maintainable Python and are comfortable building evaluation and experimentation pipelines that run in production environments
Experience with Evaluation-Driven or Experiment-Driven Development: Experience using structured evaluation or experimentation frameworks to drive system iteration
Ability to Translate Human Judgment into Code: Work with subject matter experts to elicit high-quality judgments and encode them into test cases, scoring functions, and graders
Systems-Oriented Mindset: Understand how evaluation interacts with prompts, agents, data, and deployment
AI-Native Working Style: Use AI tools to generate tests, analyze failures, explore edge cases, and accelerate debugging and iteration
Travel: Travel between 10-50% of the time, depending on the project

What We Offer

Base salary range: $150K – $250K
Meaningful equity
100% covered medical, dental, and vision for employees and dependents
401(k) with additional perks
Access to state-of-the-art models and modern AI tools
Offices in San Francisco and New York with hybrid collaboration model (3+ days per week Tuesday–Thursday in-office)

Skills

PythonEvaluation FrameworksExperimentationLLM-based GradersPrompt EngineeringAI SystemsTest Case DevelopmentRegression TestingProduction PipelinesModel Evaluation

Similar roles at this salary range

All ML Engineering jobs →

Ironclad

Jun 18

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals

Mercury

Jun 18

Senior Machine Learning Operations Engineer

Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.

167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP

Grafana Labs

Jun 18

Senior AI Engineer

Senior Engineer building multi-agent AI systems, LLM integrations, and backend automation services that power Marketing Operations. Owns technical direction for agentic infrastructure connecting models to business systems.

154k – 185kUnited StatesML EngineeringRemote8+ YOERAGGit

Nuro

Jun 16

Software Engineer, ML Infrastructure

Build and scale ML infrastructure platform for autonomous vehicle development, focusing on automated resource provisioning, high-performance workload scheduling, and petabyte-scale data processing pipelines.

160k – 241kMountain View, CAML EngineeringOn-site3+ YOERaySlurm

Nuro

Jun 16

Software Engineer, ML Infrastructure, Optimization

Build and optimize ML infrastructure for autonomous vehicles, focusing on model optimization, compilers, and deployment across the autonomy stack. Requires 2+ years in ML optimization and strong Python/C++/CUDA skills.

160k – 241kMountain View, CAML EngineeringOn-site2+ YOEC++JAX

Apply