Senior Software Engineer, AI Evals

240k – 280kSan Francisco, CAHybrid5+ YOEJan 28

Summary

Build evaluation infrastructure for AI systems at Sentry, designing datasets, benchmarks, and test harnesses to measure accuracy and reliability of debugging agents. Requires 5+ years experience, Python/TypeScript proficiency, and AI/ML background.

About the role

In this role you will

Design and build robust evaluation frameworks to measure accuracy, reliability, regressions, and edge cases in AI systems
Create and curate high-quality datasets, golden test cases, and benchmarks grounded in real production data
Build automated test harnesses and metrics pipelines to continuously evaluate models, prompts, and agentic workflows
Partner closely with applied AI engineers and product leaders to define what “good” looks like and translate it into measurable criteria
Own the evaluation lifecycle for major AI initiatives, from early experimentation through production monitoring

You’ll love this job if you

Care deeply about correctness, rigor, and measurement in AI systems
Enjoy turning fuzzy product goals and model behavior into concrete tests and metrics
Like building foundational infrastructure that unlocks faster iteration and higher confidence for the entire AI team
Thrive in cross-functional environments and enjoy influencing model design through better evaluation

Qualifications

Minimum 5+ years of professional experience with a Bachelor’s degree in computer science, machine learning, or a related field
Experience building testing, evaluation, or data infrastructure for complex systems (AI/ML experience strongly preferred)
Comfort writing production-quality code (Python and TypeScript)
Experience working with structured and unstructured datasets, labeling workflows, or data quality pipelines
Familiarity with modern ML systems and evaluation techniques (e.g., offline metrics, online evaluation, regression testing for models or prompts)
Bonus: experience evaluating LLMs, agentic systems, or AI-assisted developer tools

Base salary range: $240,000 to $280,000 USD.

Skills

PythonTypeScriptMachine LearningAI EvaluationLLMsDatasetsMetrics PipelinesEvaluation FrameworksAgentic SystemsData Infrastructure

Similar roles at this salary range

All ML Engineering jobs →

Plaid

Jun 18

Machine Learning Engineer - Embedded Insights

Drive ML initiatives from concept to production on the Embedded Insights team. Identify opportunities, build and deploy models using Plaid's financial datasets, and partner with product teams to deliver scalable customer-facing intelligence products.

212k – 272kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLMLOps

Plaid

Jun 18

Machine Learning Engineer

Advance Plaid’s foundation models by developing novel architectures, pretraining objectives, and fine-tuning strategies. Work across the full ML stack from data engineering to production serving and monitoring.

212k – 272kSan Francisco, CA +2ML EngineeringHybrid1+ YOELLMsPython

Airbnb

Jun 18

Senior Machine Learning Engineer

Build and deploy cutting-edge Agentic AI and LLM systems to transform Airbnb's customer service experience, including Chat and Voice AI assistants. Requires 6+ years experience with production ML/AI systems at scale.

196k – 227kUnited StatesML EngineeringRemote6+ YOELLMSFT

Decagon

Jun 18

Staff Software Engineer, Agents

Build and own end-to-end AI agents for enterprise customers, integrating latest text/voice models and iterating based on real-world usage. Requires 8+ years of software engineering experience with Python and TypeScript.

200k – 400kSan Francisco, CAML EngineeringOn-site8+ YOEPythonAI Agents

Jun 17

Staff Machine Learning Engineer, Notifications Relevance

Technical leader for Reddit's Notifications Relevance ML systems, driving large-scale recommendation systems spanning retrieval, ranking, budget optimization, and LLM-powered experiences.

230k – 322kUnited StatesML EngineeringRemote8+ YOEPythonGolang

Apply