Skip to content

Data Engineer

Design, build, and maintain data pipelines for biomedical and clinical research datasets. Work with scientists and researchers to deliver accessible, well-governed data products using Python, SQL, and ETL/ELT processes.

Rockville, MDData EngineeringOnsite

About the role

Key Responsibilities

Data Pipeline Development

  • Design, build, test, and maintain data pipelines to ingest, transform, harmonize, and integrate diverse biomedical and research data sources, including clinical, genomic, experimental, imaging, biospecimen, operational, and other scientific datasets
  • Develop reusable transformation logic and curated datasets that support analytics, reporting, dashboards, applications, APIs, and downstream research workflows

Data Integration and Lifecycle Support

  • Support the full research data lifecycle by enabling reliable data movement from source systems and storage environments into structured, analysis-ready formats
  • Assist with data ingestion, curation, metadata capture, data refreshes, source-to-target mapping, schema management, and long-term maintainability of data products and workflows

Collaboration

  • Work closely with data scientists, bioinformaticians, researchers, application developers, project managers, and government stakeholders to gather requirements and deliver practical data solutions
  • Translate scientific and operational data needs into technical specifications, data models, transformation logic, and reusable datasets

Quality & Governance

  • Implement data validation checks, reconciliation routines, testing practices, and monitoring processes to ensure data accuracy, completeness, consistency, and integrity
  • Follow data governance and security best practices, including documentation of transformations, lineage, assumptions, access requirements, and compliance considerations

Dashboarding & Integration

  • Create or support interactive dashboards, reporting layers, APIs, and application-ready datasets
  • Support integration between data pipelines, databases, cloud platforms, analytics environments, and approved application platforms

Operational Support and Modernization

  • Troubleshoot data pipeline failures, source system inconsistencies, data quality issues, schema changes, access issues, and performance bottlenecks
  • Contribute to modernization efforts by improving automation, documentation, scalability, reproducibility, and platform readiness

Required Qualifications

  • Bachelor's degree in Computer Science, Data Science, Bioinformatics, Biomedical Informatics, Information Systems, Engineering, or a related field, or equivalent practical experience
  • Proven experience as a Data Engineer, Analytics Engineer, Data Integration Developer, Bioinformatics Engineer, or similar data-intensive role
  • Strong proficiency in Python and SQL for data manipulation, transformation, scripting, automation, and analysis
  • Hands-on experience building ETL/ELT processes and data pipelines to support large, complex, multi-source datasets
  • Familiarity with scalable data processing approaches, including Spark/PySpark or similar frameworks
  • Solid understanding of data modeling, relational databases, data warehouses, data lakes, metadata, and database concepts
  • Ability to work with complex, multi-modal datasets, including structured, semi-structured, and unstructured data
  • Knowledge of software engineering and data engineering best practices, including version control using Git, code review, automated testing, documentation, peer review, and change management
  • Experience ensuring data quality and using lineage, provenance tracking, audit trails, or documentation practices
  • Excellent problem-solving skills and the ability to communicate effectively with both technical and non-technical stakeholders
  • Strong interest in biomedical science, clinical research, healthcare data, and scientific discovery
  • Demonstrated awareness of sensitive data handling, privacy, access control, data governance, and regulatory or compliance expectations

Preferred Qualifications

  • Hands-on experience building data solutions in modern data platforms or platform-as-a-service environments such as Snowflake, Databricks, Palantir, cloud data warehouses, data lakes, or similar platforms
  • Experience supporting integrations across databases, cloud storage, APIs, analytics platforms, dashboards, and application environments
  • Experience preparing curated datasets for dashboards, APIs, web applications, reporting tools, notebooks, or scientific computing environments
  • Familiarity with research-facing tools and platforms such as Posit Connect, R/Shiny, Streamlit, Jupyter, Galaxy, Code

Skills

PythonSQLETLELTSparkPysparkGitSnowflakeDatabricksData Modeling

Software Engineer, Storage

Software Engineer on the Storage team owning the data layer (databases, caches, scaling strategies) that underpins all Cursor products. Design multi-database architectures, build query guardrails, define storage best practices, and own cache infrastructure for reliability and growth.

San Francisco, CA +1Data EngineeringOn-site5+ YOEOltpMySQL

Healthcare Data Analyst

Create advanced SQL/Spark SQL queries and prompt-engineered LLM workflows to transform healthcare claims data into clinical insights and automated policy tools. Requires 3-5 years SQL plus 2-3 years healthcare experience.

140k – 170kUnited StatesData EngineeringRemote3+ YOESQLClaude

Analytics Engineer

Build and maintain data models, pipelines, and dashboards that power customer experience and compliance operations. Partner with CX and compliance teams to deliver trusted, self-serve analytics.

152k – 179kUnited StatesData EngineeringRemote3+ YOESQLdbt

Data Engineer

Senior Data Engineer building scalable data pipelines and infrastructure on AWS using Spark, Metaflow, and container orchestration. Requires 5+ years of experience designing distributed data systems.

145k – 190kUnited StatesData EngineeringRemote5+ YOEAWSSQL

Software Engineer, Sensor Integration

Build and maintain ingestion pipelines that convert large-scale geospatial sensor data (LiDAR, imagery) into standardized formats for ML training and product use. Requires strong Python skills, comfort with undocumented formats, and distributed systems experience.

San Francisco, CAData EngineeringHybridC++Gdal