Key Responsibilities
Data Pipeline Development
- Design, build, test, and maintain data pipelines to ingest, transform, harmonize, and integrate diverse biomedical and research data sources, including clinical, genomic, experimental, imaging, biospecimen, operational, and other scientific datasets
- Develop reusable transformation logic and curated datasets that support analytics, reporting, dashboards, applications, APIs, and downstream research workflows
Data Integration and Lifecycle Support
- Support the full research data lifecycle by enabling reliable data movement from source systems and storage environments into structured, analysis-ready formats
- Assist with data ingestion, curation, metadata capture, data refreshes, source-to-target mapping, schema management, and long-term maintainability of data products and workflows
Collaboration
- Work closely with data scientists, bioinformaticians, researchers, application developers, project managers, and government stakeholders to gather requirements and deliver practical data solutions
- Translate scientific and operational data needs into technical specifications, data models, transformation logic, and reusable datasets
Quality & Governance
- Implement data validation checks, reconciliation routines, testing practices, and monitoring processes to ensure data accuracy, completeness, consistency, and integrity
- Follow data governance and security best practices, including documentation of transformations, lineage, assumptions, access requirements, and compliance considerations
Dashboarding & Integration
- Create or support interactive dashboards, reporting layers, APIs, and application-ready datasets
- Support integration between data pipelines, databases, cloud platforms, analytics environments, and approved application platforms
Operational Support and Modernization
- Troubleshoot data pipeline failures, source system inconsistencies, data quality issues, schema changes, access issues, and performance bottlenecks
- Contribute to modernization efforts by improving automation, documentation, scalability, reproducibility, and platform readiness
Required Qualifications
- Bachelor's degree in Computer Science, Data Science, Bioinformatics, Biomedical Informatics, Information Systems, Engineering, or a related field, or equivalent practical experience
- Proven experience as a Data Engineer, Analytics Engineer, Data Integration Developer, Bioinformatics Engineer, or similar data-intensive role
- Strong proficiency in Python and SQL for data manipulation, transformation, scripting, automation, and analysis
- Hands-on experience building ETL/ELT processes and data pipelines to support large, complex, multi-source datasets
- Familiarity with scalable data processing approaches, including Spark/PySpark or similar frameworks
- Solid understanding of data modeling, relational databases, data warehouses, data lakes, metadata, and database concepts
- Ability to work with complex, multi-modal datasets, including structured, semi-structured, and unstructured data
- Knowledge of software engineering and data engineering best practices, including version control using Git, code review, automated testing, documentation, peer review, and change management
- Experience ensuring data quality and using lineage, provenance tracking, audit trails, or documentation practices
- Excellent problem-solving skills and the ability to communicate effectively with both technical and non-technical stakeholders
- Strong interest in biomedical science, clinical research, healthcare data, and scientific discovery
- Demonstrated awareness of sensitive data handling, privacy, access control, data governance, and regulatory or compliance expectations
Preferred Qualifications
- Hands-on experience building data solutions in modern data platforms or platform-as-a-service environments such as Snowflake, Databricks, Palantir, cloud data warehouses, data lakes, or similar platforms
- Experience supporting integrations across databases, cloud storage, APIs, analytics platforms, dashboards, and application environments
- Experience preparing curated datasets for dashboards, APIs, web applications, reporting tools, notebooks, or scientific computing environments
- Familiarity with research-facing tools and platforms such as Posit Connect, R/Shiny, Streamlit, Jupyter, Galaxy, Code