What You'll Do
- Build high-throughput bulk ingestion workflows to integrate datasets from multiple external providers.
- Design and implement scalable entity-resolution solutions, including record linking, deduplication, clustering, and conflict arbitration.
- Create and refine matching logic, decision rules, and similarity functions to align datasets with high accuracy and strong coverage.
- Define and track data quality indicators, such as overlap metrics, match precision/recall, duplicate rates, and completeness.
- Prepare training-ready datasets in formats such as TFRecords, and structure data to meet ML research requirements.
- Develop processing components using Dataflow (Beam) and manage large analytical workloads in BigQuery.
- Leverage frameworks like Ray to accelerate large-scale experiments, feature extraction, and research-oriented data preparation.
- Collaborate with ML researchers to anticipate downstream requirements and evolve linkage strategies as new sources and use cases emerge.
What We're Looking For
- Experience working with large, heterogeneous datasets from multiple providers or domains.
- Strong background in entity resolution, deduplication, data unification, or related large-scale data integration techniques.
- Proficiency in Python, with an emphasis on efficient, scalable data processing.
- Experience with BigQuery, Google Dataflow/Apache Beam, or similar batch-processing frameworks.
- Familiarity with data validation, normalization, reconciliation, and building consistent views across diverse data sources.
- Ability to craft well-structured matching and decision strategies that balance accuracy, completeness, and computational efficiency.
- Comfortable iterating quickly on pragmatic solutions, balancing correctness with time-to-delivery.
- Clear communication skills and the ability to collaborate closely with ML and research teams.
Nice to Have
- Knowledge of architecting Google Cloud Platform systems at scale
- Experience with distributed compute frameworks such as Ray, Spark, or Flink.
- Understanding of JAX-based ML pipelines, multihost training setups, or large-scale data preparation for accelerator-backed workflows.
- Familiarity with TFRecords or other high-volume training data formats.
- Exposure to ranking, clustering, or statistical similarity modeling.
- Experience with Go, NextJS, and/or React Native to contribute to full-stack development
Compensation
Base salary range: $180,000 - $220,000, plus equity and benefits.