# Data Scientist II - Big Data R&D, Identity Graph & KYC
**Company:** [Socure](https://hotfix.jobs/companies/socure)
**Location:** Remote
**Salary:** $140K-$170K
**Experience:** 2+ years
**Skills:** Python, SQL, Spark, Pyspark, scikit-learn, Xgboost, AWS, Emr, S3, Databricks, Neo4J, Graphframes, TensorFlow, PyTorch, Airflow
**Posted:** 2026-04-23
> Develop graph-based algorithms, entity-resolution systems, and data pipelines on massive PII datasets to power KYC and compliance products. Requires Master's/PhD with 2+ years experience, Python/SQL proficiency, Spark, and ML libraries.
## Job Description
## What You'll Do
- Contribute to the design and implementation of machine learning, data mining, statistical, and graph-based algorithms to analyze very large datasets for identity verification and anomaly detection.
- Analyze large datasets to help develop and refine entity-resolution and identity-matching algorithms that drive Socure’s KYC and compliance solutions.
- Build and maintain components of data-processing pipelines (ETL, feature generation, normalization) using tools such as Spark/PySpark and AWS (e.g., EMR, S3).
- Support senior data scientists with feature engineering, data exploration, error analysis, and A/B test setup for new models and signals.
- Help evaluate new third‑party and internal data sources: profile data quality, design offline experiments, and summarize impact on coverage and model performance.
- Implement and maintain SQL and Python/R code for data extraction, transformation, and validation; contribute to code reviews and basic testing.
- Provide analytical support to compliance and regulatory product teams, including ad hoc investigations, simple dashboards, and data deep dives.
- Communicate findings in a clear, structured way to peers and cross‑functional partners (Product, Engineering, Client Analysis), focusing on key insights and trade‑offs.
- Work effectively in a fast‑paced, cross‑functional environment; demonstrate ownership of well-scoped tasks and follow through to completion.

## What You Bring
- Master’s degree with 2+ years of experience, or Ph.D. with 1+ years of experience in a data science or analytics role, or equivalent practical experience.
- Proficiency in at least one general-purpose programming language used in data science (**Python**, or **Scala**).
- Solid experience writing and optimizing **SQL** for large datasets; comfort working in data lake / warehouse environments.
- Hands‑on experience with **Spark** or **PySpark** and common ML libraries (e.g., **scikit-learn**, **XGBoost**, **TensorFlow**/**PyTorch** a plus).
- Familiarity with **UNIX** environments and the **AWS** ecosystem (e.g., **EMR**, **S3**); **Databricks** experience is a plus.
- Working knowledge of supervised/unsupervised ML and basic statistics (similarity measures, clustering, evaluation metrics).
- Exposure to graph techniques or graph databases (**Neo4j**, **AWS Neptune**, **GraphFrames**) is a strong plus.
- Bonus: experience with **Elasticsearch** or **DynamoDB**; workflow tools such as **Airflow** for automating data pipelines.
**Apply:** https://hotfix.jobs/jobs/data-scientist-ii-big-data-r-d-identity-graph-kyc-at-socure-385c8bb3-71ae-4906-a060-ee5220b8a889
**Canonical:** https://hotfix.jobs/jobs/data-scientist-ii-big-data-r-d-identity-graph-kyc-at-socure-385c8bb3-71ae-4906-a060-ee5220b8a889