Overview
Key Responsibilities
• Build and optimize data ingestion, transformation, and integration pipelines across multiple sources — clinical trials, EHR/EMR, laboratory systems, and commercial platforms.
• Implement data lakes and data warehouses using modern cloud technologies (Azure, AWS, or GCP).
• Develop and manage ETL/ELT workflows using tools such as Databricks, Azure Data Factory, or AWS Glue.
• Ensure data quality, lineage, and governance aligned with compliance frameworks (HIPAA, GxP, GDPR).
• Collaborate with data scientists and analytics teams to create reusable data models and feature stores.
• Optimize data access and performance for analytical workloads and visualization tools.
• Automate deployments and monitoring using DevOps pipelines (Git, Jenkins, Azure DevOps).
Required Skills & Experience
• Minimum 3 years of experience in data engineering or related roles.
• Strong programming expertise in Python, PySpark, or Scala.
• Proven experience with SQL and big-data frameworks (Spark, Hadoop, Kafka).
• Hands-on experience with cloud-based data platforms – Azure Data Factory, Databricks, AWS Glue, Snowflake, or GCP Dataflow.
• Solid understanding of data modeling techniques (Star, Snowflake, Dimensional).
• Exposure to Life Sciences / Pharma datasets such as clinical trials, bioinformatics, or patient data models (CDISC, HL7, FHIR).
• Knowledge of data security and compliance in regulated environments.
Good to Have
• Experience with real-world evidence (RWE) or pharma commercial analytics datasets.
• Familiarity with machine learning data preparation pipelines.
• Knowledge of data visualization tools (Power BI, Tableau).
• Cloud or data engineering certifications (Azure, AWS, GCP, Snowflake).