# Research Engineer Intern, Evaluations
**Company:** [TensorStax](https://hotfix.jobs/companies/tensorstax)
**Location:** San Francisco, CA
**Skills:** Python, PyTorch, JAX, Reinforcement Learning, RLHF, Snowflake, BigQuery, dbt, Spark, Kafka
**Posted:** 2025-03-18
> Designs evaluation frameworks and benchmarks to test AI agents' autonomy, reasoning, and reliability in data pipelines and warehouses. Requires experience in LLM benchmarking, reinforcement learning, Python, PyTorch/JAX, and data engineering tools.
## Job Description
## What You’ll Do
- Develop evaluation environments to test AI agents' ability to reason, plan, and act autonomously within mission-critical data pipelines.
- Design benchmarks to assess model capabilities in failure detection, pipeline optimization, and agentic decision-making in data workflows.
- Implement automated assessment frameworks for language model-based agents operating over data lakes and warehouses.
- Work with synthetic and real-world datasets to create robust testing environments for AI-driven data automation.
- Collaborate with research engineers to refine reward shaping strategies, guiding models toward more efficient and agentic behaviors in data-intensive tasks.

## What We’re Looking For
- Experience in language model research, with a focus on benchmarking LLMs in mission-critical domains.
- Strong background in AI evaluation methodologies, reinforcement learning, and RLHF techniques.
- Familiarity with benchmarking language models for structured and unstructured data tasks.
- Proficiency in Python and experience with ML frameworks like PyTorch or JAX.
- Hands-on experience with data lakes, warehouses, and data engineering tools (Snowflake, BigQuery, dbt, Spark, Kafka).
- High agency—proactive, resourceful, and comfortable working in a fast-paced research environment with minimal supervision.
- Attention to detail—ability to design rigorous, reproducible experiments and evaluations.

## Bonus Points
- Contributions to open-source AI benchmarks (e.g., SweBench, BIRD, SPIDER).
- Contributions to open-source agentic frameworks.
- Experience developing custom RL environments for AI evaluation.
- Strong understanding of ETL, ELT, and data transformation pipelines.
**Apply:** https://hotfix.jobs/jobs/research-engineer-intern-evaluations-at-tensorstax-1bb5568b-1a83-4624-a440-697d02cd6a4c
**Canonical:** https://hotfix.jobs/jobs/research-engineer-intern-evaluations-at-tensorstax-1bb5568b-1a83-4624-a440-697d02cd6a4c