Skip to content

Research Engineer Intern, Evaluations

Designs evaluation frameworks and benchmarks to test AI agents' autonomy, reasoning, and reliability in data pipelines and warehouses. Requires experience in LLM benchmarking, reinforcement learning, Python, PyTorch/JAX, and data engineering tools.

San Francisco, CAML EngineeringHybrid

About the role

What You’ll Do

  • Develop evaluation environments to test AI agents' ability to reason, plan, and act autonomously within mission-critical data pipelines.
  • Design benchmarks to assess model capabilities in failure detection, pipeline optimization, and agentic decision-making in data workflows.
  • Implement automated assessment frameworks for language model-based agents operating over data lakes and warehouses.
  • Work with synthetic and real-world datasets to create robust testing environments for AI-driven data automation.
  • Collaborate with research engineers to refine reward shaping strategies, guiding models toward more efficient and agentic behaviors in data-intensive tasks.

What We’re Looking For

  • Experience in language model research, with a focus on benchmarking LLMs in mission-critical domains.
  • Strong background in AI evaluation methodologies, reinforcement learning, and RLHF techniques.
  • Familiarity with benchmarking language models for structured and unstructured data tasks.
  • Proficiency in Python and experience with ML frameworks like PyTorch or JAX.
  • Hands-on experience with data lakes, warehouses, and data engineering tools (Snowflake, BigQuery, dbt, Spark, Kafka).
  • High agency—proactive, resourceful, and comfortable working in a fast-paced research environment with minimal supervision.
  • Attention to detail—ability to design rigorous, reproducible experiments and evaluations.

Bonus Points

  • Contributions to open-source AI benchmarks (e.g., SweBench, BIRD, SPIDER).
  • Contributions to open-source agentic frameworks.
  • Experience developing custom RL environments for AI evaluation.
  • Strong understanding of ETL, ELT, and data transformation pipelines.

Skills

PythonPyTorchJAXReinforcement LearningRLHFSnowflakeBigQuerydbtSparkKafka

Similar roles

ML Engineering jobs

Research Intern RL & Post-Training Systems

Research intern role focused on RL and post-training systems for large language models, co-designing algorithms and inference systems. Requires strong research experience in RL/post-training or ML systems, Python proficiency, and willingness to work across abstraction layers.

121k – 131kSan Francisco, CAML EngineeringOn-siteEntry levelC++Dpo

Research Intern, Model Shaping

Research intern on the Model Shaping team working on post-training methods, efficient neural network training, and foundation model evaluation. Requires strong ML fundamentals and PyTorch/JAX experience.

121k – 131kSan Francisco, CAML EngineeringOn-siteEntry levelJAXPyTorch

Systems Research Engineer Intern - GPU Programming

Intern developing and optimizing GPU-accelerated kernels for ML/AI applications. Requires strong GPU programming background (CUDA/Triton) and knowledge of performance optimization.

121k – 131kSan Francisco, CAML EngineeringOn-siteEntry levelCUDATriton

Research Intern, Inference

Research intern on the Inference team building efficient serving systems for large foundation models. Focus on distributed inference, compiler-aware optimization, and novel inference-time strategies.

121k – 131kSan Francisco, CAML EngineeringOn-siteEntry levelJAXCUDA

Machine Learning Engineer, PhD Intern

PhD intern role focused on LLM research, large-scale ML systems, and e-commerce applications including search, recommendations, and knowledge graphs. Requires strong ML foundations and programming skills.

United StatesML EngineeringRemoteEntry levelGovLLM