Research, Pre-Training Data
Designs and implements methods for sourcing, curating, and analyzing large-scale pre-training datasets for AI models, blending research with production-grade data engineering. Requires Python proficiency, deep learning frameworks, and strong ML fundamentals.
What You’ll Do
- Design and implement techniques for curating, sourcing, and filtering large-scale text, code, and multimodal data.
- Develop data quality metrics and analysis to measure coverage, diversity, and representativeness across sources.
- Collaborate with research and infrastructure teams to scale data processing systems efficiently and reproducibly.
- Investigate and mitigate data risks, including privacy, safety, and licensing concerns, to ensure responsible and ethical data use.
- Continuously evaluate dataset improvements by analyzing their downstream effects on model learning and behavior.
- Publish and present research that moves the entire community forward. Share code, datasets, and insights that accelerate progress across industry and academia.
Skills and Qualifications
Minimum qualifications:
- Proficiency in Python and familiarity with at least one deep learning framework (e.g., PyTorch, TensorFlow, or JAX). Comfortable with debugging distributed training and writing code that scales.
- Bachelor’s degree or equivalent experience in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding.
- Clarity in communication, an ability to explain complex technical concepts in writing.
Preferred qualifications:
- A strong grasp of probability, statistics, and ML fundamentals. You can look at experimental data and distinguish between real effects, noise, and bugs.
- Experience with curation, preprocessing, and analysis of large-scale text, code, or multimodal datasets.
- Prior experience in data engineering, dataset construction, or large-scale web data processing for machine learning models.
- Experience evaluating or improving training data quality and knowledge of data ethics, safety, and licensing frameworks relevant to AI dataset creation.
- Contributions to open datasets, research publications, or data tooling.
- PhD in Computer Science, Machine Learning, Physics, Mathematics, or a related discipline with strong theoretical and empirical grounding; or, equivalent industry research experience.
Logistics
Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.
Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.
Research Engineer/Research Scientist
Research Engineer/Scientist improving model capabilities for personalized AI experiences. Focus on tool-use, instruction following, evaluations, and training improvements. Requires strong ML engineering and research experience.
Member of Technical Staff
Hands-on technical contributor focused on stabilizing and advancing large language model training, fine-tuning, and research in AI/deep learning. Requires a bachelor's degree and 2+ years of experience with distributed systems, ML infrastructure, and programming in Rust/C++/Python.
Research Engineer / Research Scientist
Research and develop improvements to models' personalization and agentic capabilities through reinforcement learning, dataset creation, and post-training methods. Requires strong ML engineering skills and research experience with novel models.