
Overview
Role description
Key Responsibilities:
Design, develop, and automate scalable data processing workflows using Apache Airflow, PySpark, and Dataproc on Google Cloud Platform (GCP).
Build and maintain robust ETL pipelines to handle structured and unstructured data from multiple sources and formats.
Manage and provision GCP resources including Dataproc clusters, serverless batches, Vertex AI instances, GCS buckets, and custom images.
Provide platform and pipeline support for analytics and product teams, resolving issues related to Spark, BigQuery, Airflow DAGs, and serverless workflows.
Collaborate with data scientists, data analysts, and other stakeholders to understand data requirements and deliver reliable solutions.
Deliver prompt and effective technical support to internal users for data-related queries and challenges.
Optimize and fine-tune data systems for performance, cost-efficiency, and reliability.
Conduct root cause analysis for recurring pipeline/platform issues and work with cross-functional teams to implement long-term solutions.
Must-Have Skills:
Strong programming expertise in Python and SQL
Deep hands-on experience with Apache Airflow (including Astronomer)
Strong experience with PySpark, SparkSQL, and Dataproc
Proven knowledge and implementation experience on GCP data services:
BigQuery, Vertex AI, Pub/Sub, Cloud Functions, GCS
Strong troubleshooting skills related to data pipelines, Spark job failures, and cloud data environments
Familiarity with data modeling, ETL best practices, and distributed systems
Ability to support and optimize large-scale batch and streaming data processes
Good-to-Have Skills:
Experience with SQL dialects like HiveQL, PL/SQL, and SparkSQL
Exposure to serverless data processing and ML model deployment workflows (using Vertex AI)
Familiarity with Terraform or Infrastructure-as-Code (IaC) for provisioning GCP resources
Knowledge of data governance, monitoring, and cost control best practices on GCP
Previous experience in healthcare, retail, or BFSI domains involving large-scale data platforms
Educational Qualification:
Bachelor's or Master's degree in Computer Science, Information Technology, or a related field
Certifications in GCP Data Engineer, GCP Professional Cloud Architect, or Apache Spark are a plus
Skills
Gcp,Pyspark,Airflow