Skip to content

Senior Software Engineer, AI Evals

240k – 280kSan Francisco, CAHybrid5+ YOE
Summary

Build evaluation infrastructure for AI systems at Sentry, designing datasets, benchmarks, and test harnesses to measure accuracy and reliability of debugging agents. Requires 5+ years experience, Python/TypeScript proficiency, and AI/ML background.

About the role

In this role you will

  • Design and build robust evaluation frameworks to measure accuracy, reliability, regressions, and edge cases in AI systems
  • Create and curate high-quality datasets, golden test cases, and benchmarks grounded in real production data
  • Build automated test harnesses and metrics pipelines to continuously evaluate models, prompts, and agentic workflows
  • Partner closely with applied AI engineers and product leaders to define what “good” looks like and translate it into measurable criteria
  • Own the evaluation lifecycle for major AI initiatives, from early experimentation through production monitoring

You’ll love this job if you

  • Care deeply about correctness, rigor, and measurement in AI systems
  • Enjoy turning fuzzy product goals and model behavior into concrete tests and metrics
  • Like building foundational infrastructure that unlocks faster iteration and higher confidence for the entire AI team
  • Thrive in cross-functional environments and enjoy influencing model design through better evaluation

Qualifications

  • Minimum 5+ years of professional experience with a Bachelor’s degree in computer science, machine learning, or a related field
  • Experience building testing, evaluation, or data infrastructure for complex systems (AI/ML experience strongly preferred)
  • Comfort writing production-quality code (Python and TypeScript)
  • Experience working with structured and unstructured datasets, labeling workflows, or data quality pipelines
  • Familiarity with modern ML systems and evaluation techniques (e.g., offline metrics, online evaluation, regression testing for models or prompts)
  • Bonus: experience evaluating LLMs, agentic systems, or AI-assisted developer tools

Base salary range: $240,000 to $280,000 USD.

Skills
PythonTypeScriptMachine LearningAI EvaluationLLMsDatasetsMetrics PipelinesEvaluation FrameworksAgentic SystemsData Infrastructure
Similar roles at this salary range
All ML Engineering jobs →
Plaid

Machine Learning Engineer - Embedded Insights

Drive ML initiatives from concept to production on the Embedded Insights team. Identify opportunities, build and deploy models using Plaid's financial datasets, and partner with product teams to deliver scalable customer-facing intelligence products.

212k – 272kSan Francisco, CA +2ML EngineeringHybrid5+ YOESQLMLOps
Plaid

Machine Learning Engineer

Advance Plaid’s foundation models by developing novel architectures, pretraining objectives, and fine-tuning strategies. Work across the full ML stack from data engineering to production serving and monitoring.

212k – 272kSan Francisco, CA +2ML EngineeringHybrid1+ YOELLMsPython
Airbnb

Senior Machine Learning Engineer

Build and deploy cutting-edge Agentic AI and LLM systems to transform Airbnb's customer service experience, including Chat and Voice AI assistants. Requires 6+ years experience with production ML/AI systems at scale.

196k – 227kUnited StatesML EngineeringRemote6+ YOELLMSFT
Decagon

Staff Software Engineer, Agents

Build and own end-to-end AI agents for enterprise customers, integrating latest text/voice models and iterating based on real-world usage. Requires 8+ years of software engineering experience with Python and TypeScript.

200k – 400kSan Francisco, CAML EngineeringOn-site8+ YOEPythonAI Agents
Reddit

Staff Machine Learning Engineer, Notifications Relevance

Technical leader for Reddit's Notifications Relevance ML systems, driving large-scale recommendation systems spanning retrieval, ranking, budget optimization, and LLM-powered experiences.

230k – 322kUnited StatesML EngineeringRemote8+ YOEPythonGolang