Overview
Position Summary:
As a Senior Data Engineer, you will play a pivotal role in designing, developing, and optimizing data pipelines and workflows that support large-scale data processing using Apache Spark (PySpark) and advanced SQL techniques. You will work closely with data analysts, data scientists, and platform engineers to build reliable and scalable data infrastructure. Your focus will be on transforming raw data into actionable insights while ensuring data quality, security, and performance.
Key Responsibilities:Data Pipeline Development & Optimization:
- Build scalable and efficient ETL/ELT pipelines using PySpark, handling batch and real-time data workloads.
- Write advanced SQL queries to cleanse, aggregate, and transform large datasets for analytical and operational use cases.
- Optimize performance of data processing jobs using Spark configurations, partitioning, and caching strategies.
- Maintain and improve existing data pipelines, refactoring code and improving performance as needed.
Data Architecture & Integration:
- Work with structured, semi-structured (JSON, Parquet, Avro), and unstructured data.
- Design and implement data models and schemas in cloud data warehouses (e.g., Snowflake, BigQuery, Redshift) and data lakes.
- Integrate data from various sources including relational databases (MySQL, PostgreSQL), APIs, streaming platforms (Kafka, Kinesis), and external data providers.
Collaboration & Strategy:
- Collaborate with data scientists and BI analysts to understand data needs and design solutions that meet performance and scalability requirements.
- Partner with DevOps and platform teams to deploy data pipelines and workflows using orchestration tools such as Apache Airflow or Prefect.
- Participate in design reviews, code reviews, and engineering discussions to contribute to the team's best practices and technical direction.
Monitoring, Testing & Governance:
- Monitor data pipeline performance, failures, and system health using logging and alerting tools.
- Implement data quality checks, testing strategies (unit, integration), and lineage tracking.
- Support data governance, compliance, and documentation efforts by enforcing standards and creating metadata catalogs.
Required Qualifications:
- Bachelor’s or Master’s degree in Computer Science, Engineering, Information Systems, or a related technical field.
- 6+ years of professional experience in a Data Engineering role.
- Strong programming skills in Python with expertise in PySpark for distributed data processing.
- Advanced SQL skills — ability to write complex joins, CTEs, window functions, and performance-optimized queries.
- Hands-on experience with Apache Spark in a cloud environment (Databricks, AWS EMR, GCP DataProc, Azure Synapse).
- Familiarity with data lake architectures and cloud storage (S3, GCS, Azure Blob).
- Experience with version control systems (e.g., Git), CI/CD pipelines, and working in Agile/Scrum environments.
Preferred Qualifications:
- Experience with streaming technologies (Kafka, Spark Structured Streaming, Flink).
- Background in data modeling, data warehouse design, or data lakehouse architecture.
- Knowledge of cloud infrastructure and services (AWS, GCP, Azure).
- Familiarity with containerization tools (Docker, Kubernetes).
- Understanding of data governance, security, and privacy frameworks (GDPR, HIPAA, etc.).
Soft Skills:
- Excellent problem-solving and critical-thinking abilities.
- Strong communication skills and the ability to explain technical concepts to non-technical stakeholders.
- Ability to work independently and as part of a cross-functional team.
- Eagerness to learn and adapt to new technologies and methodologies.
Job Types: Full-time, Permanent
Pay: ₹560,330.55 - ₹1,941,152.15 per year
Schedule:
- Monday to Friday
Work Location: In person