Machine Learning Engineer, LLM Evals & Observability

Builds evaluation pipelines, LLM judges, and observability tools to measure and improve AI assistant quality. Requires 2+ years software engineering with Go/Python, LLM eval experience, and analytical rigor for backend ML infrastructure.

200k – 300kMountain View, CAML EngineeringHybrid2+ YOE

Apply

About the role

Responsibilities

Design and curate evaluation datasets – sampling strategies, query diversity, and golden sets that give reliable, representative coverage of real assistant behavior.
Build and maintain large-scale evaluation pipelines that measure assistant quality across thousands of real user queries.
Build LLM-powered judges that score metrics like correctness, completeness, and response quality, and align them against human judgment.
Evaluate new models and product changes before they ship – providing the quality signal that gates launches and prevents regressions.
Build observability infrastructure for AI agents: trace enrichment, data pipelines, and dashboards that make assistant behavior inspectable.
Close the loop between quality measurement and improvement using eval results, customer feedback, and techniques like automated prompt iteration to help drive concrete gains in assistant behavior.
Collaborate with engineers across the company to make evals a first-class part of how we ship.

Requirements

2+ years of software engineering experience with strong coding skills.
Strong backend fundamentals in Go and Python; comfortable with distributed data pipelines.
Experience working with LLM evaluation, reinforcement learning from human feedback, natural language processing, or other large systems involving machine learning.
Analytically rigorous – you think carefully about what offline metrics actually predict about real user experience.
Thrive in a customer-focused, tight-knit and cross-functional environment - being a team player and willing to take on whatever is most impactful for the company.
You care about quality – not just in the systems you build, but in the product you're helping measure and improve.

Compensation & Benefits

Base salary range: $200,000 - $300,000 annually.
Variable compensation, equity, and benefits eligibility.
Comprehensive benefits: Medical, Vision, Dental, generous time-off, 401k, home office stipend, education and wellness stipends, company events, daily lunches.

Skills

PythonGoLlm EvaluationNatural Language ProcessingDistributed Data PipelinesEvaluation PipelinesLlm-Powered JudgesObservability InfrastructureMachine Learning

Similar roles

ML Engineering jobs

Salient

Applied AI Engineer

Member of Technical Staff building production speech and language models for voice AI agents in financial services. Own core modeling, evals, and deployment on a small team with high autonomy and real revenue impact.

200k – 300kSan Francisco, CAML EngineeringOn-siteEntry levelAsrTts

Prox

Founding Engineer

Founding Engineer builds knowledge engines, multimodal agents, voice AI, and codegen systems for complex physical product support. Owns end-to-end customer deployments in fast-paced startup environment.

200k – 200kSan Francisco, CAML EngineeringOn-siteEntry levelLLMsVoice Ai

Glean

Machine Learning Engineer - LLM Evals + Observability

Build evaluation pipelines, LLM judges, and observability tools to measure and improve AI assistant quality. Requires 2+ years software engineering with strong Python/Go skills and LLM eval experience.

200k – 300kUnited StatesML EngineeringHybrid2+ YOEGoPython

Zoox

Machine Learning Engineer - Semantic Reasoning

As a Machine Learning Engineer on the Scene Understanding Semantic Reasoning team, you will design, train, and deploy deep learning models for autonomous vehicles, focusing on high-speed highway environments. This role involves cross-functional collaboration, optimization for real-time inference, and resolving perception-related edge cases.

189k – 258kFoster City, CA +1ML EngineeringHybridEntry levelC++JAX

Scale AI

Strategic Projects Lead, Generative AI

Leads cross-functional strategic projects in Generative AI to drive multimillion-dollar revenue, owning data labeling operations and product enhancements. Requires 2+ years experience, strong technical skills in SQL/Python, and entrepreneurial mindset.

181k – 226kSan Francisco, CA +2ML EngineeringOn-site2+ YOESQLPython