AI Engineer, Evaluation
Design and implement evaluation frameworks and pipelines for AI systems using Evaluation-Driven Development. Build Python-based test suites, LLM graders, and measurement systems that guide prompt iteration and production deployment decisions.
Key Responsibilities
- Design and implement evaluation frameworks that enable Evaluation-Driven Development for AI systems deployed in customer environments
- Define how system quality is measured in each domain, ensuring that evaluation signals reflect real user needs, domain constraints, and business objectives
- Build and maintain golden test cases and regression suites in Python, using both human-authored and AI-assisted test generation to capture critical behaviors and edge cases
- Develop and maintain evaluation pipelines—offline and online—that integrate directly into system iteration loops
- Define, calibrate, and operate LLM-based graders, aligning automated judgments with expert human assessments
- Work closely with Forward Deployed AI Engineers, Architects, Product Engineers, AI Strategists, and domain experts
What We Require
- 2+ years of software engineering experience
- Strong Python Engineering Skills: Write clean, maintainable Python and are comfortable building evaluation and experimentation pipelines that run in production environments
- Experience with Evaluation-Driven or Experiment-Driven Development: Experience using structured evaluation or experimentation frameworks to drive system iteration
- Ability to Translate Human Judgment into Code: Work with subject matter experts to elicit high-quality judgments and encode them into test cases, scoring functions, and graders
- Systems-Oriented Mindset: Understand how evaluation interacts with prompts, agents, data, and deployment
- AI-Native Working Style: Use AI tools to generate tests, analyze failures, explore edge cases, and accelerate debugging and iteration
- Travel: Travel between 10-50% of the time, depending on the project
What We Offer
- Base salary range: $150K – $250K
- Meaningful equity
- 100% covered medical, dental, and vision for employees and dependents
- 401(k) with additional perks
- Access to state-of-the-art models and modern AI tools
- Offices in San Francisco and New York with hybrid collaboration model (3+ days per week Tuesday–Thursday in-office)
Senior Machine Learning Operations Engineer
Build and operate Mercury's real-time ML inference platform for fraud risk decisioning. Own model deployment, observability, and lifecycle tooling with strong backend Python fundamentals.
Senior AI Engineer
Senior Engineer building multi-agent AI systems, LLM integrations, and backend automation services that power Marketing Operations. Owns technical direction for agentic infrastructure connecting models to business systems.
Software Engineer, ML Infrastructure
Build and scale ML infrastructure platform for autonomous vehicle development, focusing on automated resource provisioning, high-performance workload scheduling, and petabyte-scale data processing pipelines.
Software Engineer, ML Infrastructure, Optimization
Build and optimize ML infrastructure for autonomous vehicles, focusing on model optimization, compilers, and deployment across the autonomy stack. Requires 2+ years in ML optimization and strong Python/C++/CUDA skills.