Skip to content

AI Engineer, Evaluation

150k – 250kSan Francisco, CANew York, NYHybrid2+ YOE
Summary

Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.

About the role

Key Responsibilities

  • Design and implement evaluation frameworks that enable Evaluation-Driven Development for AI systems deployed in customer environments
  • Define how system quality is measured in each domain, ensuring that evaluation signals reflect real user needs, domain constraints, and business objectives
  • Build and maintain golden test cases and regression suites in Python, using both human-authored and AI-assisted test generation to capture critical behaviors and edge cases
  • Develop and maintain evaluation pipelines—offline and online—that integrate directly into system iteration loops
  • Define, calibrate, and operate LLM-based graders, aligning automated judgments with expert human assessments
  • Work closely with Forward Deployed AI Engineers, Architects, Product Engineers, AI Strategists, and domain experts

What We Require

  • 2+ years of software engineering experience
  • Strong Python Engineering Skills: Write clean, maintainable Python and are comfortable building evaluation and experimentation pipelines that run in production environments
  • Experience with Evaluation-Driven or Experiment-Driven Development: Experience using structured evaluation or experimentation frameworks to drive system iteration
  • Ability to Translate Human Judgment into Code: Work with subject matter experts to elicit high-quality judgments and encode them into test cases, scoring functions, and graders
  • Systems-Oriented Mindset: Understand how evaluation interacts with prompts, agents, data, and deployment
  • AI-Native Working Style: Use AI tools to generate tests, analyze failures, explore edge cases, and accelerate debugging and iteration
  • Travel: Travel between 10-50% of the time, depending on the project

What We Offer

  • Base salary range: $150K – $250K
  • Meaningful equity
  • 100% covered medical, dental, and vision for employees and dependents
  • 401(k) with additional perks
  • Access to state-of-the-art models and modern AI tools
  • Offices in San Francisco and New York with hybrid collaboration model (3+ days per week Tuesday–Thursday in-office)
Skills
PythonEvaluation FrameworksExperimentationLLM-based GradersPrompt EngineeringAI SystemsTest Case DevelopmentRegression TestingProduction PipelinesModel Evaluation
Similar roles at this salary range
All ML Engineering jobs →
Ironclad

Senior Software Engineer, AI

Lead design and delivery of high-priority AI initiatives across multiple codebases. Build and ship AI-powered features with strong backend fundamentals and product sense.

180k – 220kSan Francisco, CAML EngineeringHybrid5+ YOEReactEvals
Mercury

Senior Machine Learning Operations Engineer

Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.

167k – 208kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLSHAP
Grafana Labs

Senior AI Engineer

Senior Engineer building multi-agent AI systems, LLM integrations, and backend automation services that power Marketing Operations. Owns technical direction for agentic infrastructure connecting models to business systems.

154k – 185kUnited StatesML EngineeringRemote8+ YOERAGGit
Nuro

Software Engineer, ML Infrastructure

Build and scale ML infrastructure platform for autonomous vehicle development, focusing on automated resource provisioning, high-performance workload scheduling, and petabyte-scale data processing pipelines.

160k – 241kMountain View, CAML EngineeringOn-site3+ YOERaySlurm
Nuro

Software Engineer, ML Infrastructure, Optimization

Build and optimize ML infrastructure for autonomous vehicles, focusing on model optimization, compilers, and deployment across the autonomy stack. Requires 2+ years in ML optimization and strong Python/C++/CUDA skills.

160k – 241kMountain View, CAML EngineeringOn-site2+ YOEC++JAX