Overview
Position: Data Engineer - AWS & Pyspark
Location: Nagpur/Pune
Type of Employment: Full-Time
Purpose of the Position: You will be a critical member of the InfoCepts Cloud Data Architect Team. We are seeking an experienced Data Engineer with strong expertise in Databricks, PySpark, AWS, and Python to design and deliver scalable data pipelines, high-performance ETL frameworks, and reliable data solutions. The ideal candidate has a solid understanding of distributed data processing, cloud architecture, and modern data engineering best practices.
Key Result Areas and Activities:
Data Engineering & ETL Development
- Design, build, and optimize ETL/ELT pipelines using PySpark/Scala and Databricks on large-scale distributed data environments.
- Develop reusable data ingestion frameworks, transformation modules, and feature engineering pipelines.
- Ensure high-quality data processing with robust data validation, error handling, and observability.
Databricks Platform Engineering
- Work extensively with the Databricks Lakehouse platform?clusters, notebooks, Delta Lake, MLflow, jobs, and workflows.
- Implement best practices for Delta Lake, including schema evolution, time-travel, vacuuming, ZOrdering, partitioning, and optimization.
- Collaborate on job orchestration using Databricks Workflows, Jobs API, or Airflow
AWS Cloud Engineering
- Build and maintain data pipelines leveraging AWS services such as:
- S3, Glue, Lambda, IAM, Step Functions, Athena, Redshift or Snowflake, CloudWatch
- Implement secure data architectures following IAM, networking, encryption, and costoptimized design principles.
- Integrate Databricks with AWS data sources and event-driven systems.
- Working knowledge of OTF like Delta and Iceberg
Programming & Data Processing
- Write high-quality, production-grade Python code (modular, optimized, reusable).
- Develop PySpark jobs for batch and near real-time data transformations.
- Optimize Spark performance (partitions, broadcast variables, caching, cluster tuning).
Data Architecture, Governance & Quality
- Contribute to the design of data models, storage layers, and data lifecycle management.
- Implement best practices for data governance, metadata management, and lineage tracking.
- Ensure data reliability, performance, and accuracy across multiple environments.
Cross-Functional Collaboration
- Partner with analysts, data scientists, product teams, and business stakeholders to understand requirements.
- Document workflows, maintain Git-based version control, and participate in architecture reviews.
- Support production pipelines, troubleshoot issues, and continuously enhance system performance.