Research, Pre-Training Data

350k – 475kSan Francisco, CAML EngineeringOnsiteMay 4

Summary

Designs and implements methods for sourcing, curating, and analyzing large-scale pre-training datasets for AI models, blending research with production-grade data engineering. Requires Python proficiency, deep learning frameworks, and strong ML fundamentals.

About the role

What You’ll Do

Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data.
Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources.
Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly.
Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use.
Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior.
Publish and present research that moves the entire community forward. Share code, datasets, and insights that accelerate progress across industry and academia.

Skills and Qualifications

Minimum qualifications:

Proficiency in Python and familiarity with at least one deep learning framework (e.g., PyTorch, TensorFlow, or JAX). Comfortable with debugging distributed training and writing code that scales.
Bachelor’s degree or equivalent experience in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding.
Clarity in communication, an ability to explain complex technical concepts in writing.

Preferred qualifications:

A strong grasp of probability, statistics, and ML fundamentals. You can look at experimental data and distinguish between real effects, noise, and bugs.
Experience with curation, preprocessing, and analysis of large-scale text, code, or multimodal datasets.
Prior experience in data engineering, dataset construction, or large-scale web data processing for machine learning models.
Experience evaluating or improving training data quality and knowledge of data ethics, safety, and licensing frameworks relevant to AI dataset creation.
Contributions to open datasets, research publications, or data tooling.
PhD in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding; or, equivalent industry research experience.

Logistics

Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.

Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.

Skills

PythonPyTorchTensorFlowJAXMachine LearningData EngineeringDistributed TrainingStatisticsProbabilityMultimodal Data

Similar roles at this salary range

All ML Engineering jobs →

OpenAI

Jun 25

Research Engineer/Research Scientist

Research Engineer/Scientist improving model capabilities for personalized AI experiences. Focus on tool-use, instruction following, evaluations, and training improvements. Requires strong ML engineering and research experience.

295k – 555kSan Francisco, CAML EngineeringHybrid7+ YOEPythonResearch

xAI

Jun 24

Member of Technical Staff

Hands-on technical contributor focused on stabilizing and advancing large language model training, fine-tuning, and research in AI/deep learning. Requires a bachelor's degree and 2+ years of experience with distributed systems, ML infrastructure, and programming in Rust/C++/Python.

324k – 396kPalo Alto, CAML EngineeringOn-site2+ YOEC++GPU

xAI

Jun 24

Member of Technical Staff

Hands-on technical leader building and scaling large language models and AI systems. Requires 3-5+ years of AI/ML experience with strong Python and deep learning frameworks.

324k – 396kPalo Alto, CAML EngineeringOn-site5+ YOEC++JAX

Anthropic

Jun 23

Research Engineer, Safeguards Labs

Research engineer on the Safeguards Labs team building and evaluating novel safety methods to detect misuse, strengthen model safeguards, and reduce real-world harm from Claude.

350k – 850kSan Francisco, CA +1ML EngineeringHybridPythonClassifiers

OpenAI

Jun 19

Research Engineer / Research Scientist

Research and develop improvements to models' personalization and agentic capabilities through reinforcement learning, dataset creation, and post-training methods. Requires strong ML engineering skills and research experience with novel models.

295k – 555kSan Francisco, CAML EngineeringHybrid7+ YOEPythonPyTorch

Apply