Senior Software Engineer, AI Evals
Build evaluation infrastructure for AI systems at Sentry, designing datasets, benchmarks, and test harnesses to measure accuracy and reliability of debugging agents. Requires 5+ years experience, Python/TypeScript proficiency, and AI/ML background.
In this role you will
- Design and build robust evaluation frameworks to measure accuracy, reliability, regressions, and edge cases in AI systems
- Create and curate high-quality datasets, golden test cases, and benchmarks grounded in real production data
- Build automated test harnesses and metrics pipelines to continuously evaluate models, prompts, and agentic workflows
- Partner closely with applied AI engineers and product leaders to define what “good” looks like and translate it into measurable criteria
- Own the evaluation lifecycle for major AI initiatives, from early experimentation through production monitoring
You’ll love this job if you
- Care deeply about correctness, rigor, and measurement in AI systems
- Enjoy turning fuzzy product goals and model behavior into concrete tests and metrics
- Like building foundational infrastructure that unlocks faster iteration and higher confidence for the entire AI team
- Thrive in cross-functional environments and enjoy influencing model design through better evaluation
Qualifications
- Minimum 5+ years of professional experience with a Bachelor’s degree in computer science, machine learning, or a related field
- Experience building testing, evaluation, or data infrastructure for complex systems (AI/ML experience strongly preferred)
- Comfort writing production-quality code (Python and TypeScript)
- Experience working with structured and unstructured datasets, labeling workflows, or data quality pipelines
- Familiarity with modern ML systems and evaluation techniques (e.g., offline metrics, online evaluation, regression testing for models or prompts)
- Bonus: experience evaluating LLMs, agentic systems, or AI-assisted developer tools
Base salary range: $240,000 to $280,000 USD.
Machine Learning Engineer - Embedded Insights
Drive ML initiatives from concept to production on the Embedded Insights team. Identify opportunities, build and deploy models using Plaid's financial datasets, and partner with product teams to deliver scalable customer-facing intelligence products.
Machine Learning Engineer
Advance Plaid’s foundation models by developing novel architectures, pretraining objectives, and fine-tuning strategies. Work across the full ML stack from data engineering to production serving and monitoring.
Senior Machine Learning Engineer
Build and deploy cutting-edge Agentic AI and LLM systems to transform Airbnb's customer service experience, including Chat and Voice AI assistants. Requires 6+ years experience with production ML/AI systems at scale.
Staff Software Engineer, Agents
Build and own end-to-end AI agents for enterprise customers, integrating latest text/voice models and iterating based on real-world usage. Requires 8+ years of software engineering experience with Python and TypeScript.
Staff Machine Learning Engineer, Notifications Relevance
Technical leader for Reddit's Notifications Relevance ML systems, driving large-scale recommendation systems spanning retrieval, ranking, budget optimization, and LLM-powered experiences.