Overview
We are looking for a hands-on Data Engineer / AI Data Pipeline Engineer to join our growing engineering team. You'll work on cutting-edge AI-powered data enrichment, taxonomy validation, and scalable reporting frameworks across large-scale retail and enterprise datasets. The role sits at the intersection of data engineering and applied LLMs, and requires strong skills in Python, SQL, AWS cloud services, modern ETL architecture, and LLM-powered automation workflows.
Location
Type
Department
We are looking for a hands-on Data Engineer / AI Data Pipeline Engineer to join our growing engineering team. You'll work on cutting-edge AI-powered data enrichment, taxonomy validation, and scalable reporting frameworks across large-scale retail and enterprise datasets. The role sits at the intersection of data engineering and applied LLMs, and requires strong skills in Python, SQL, AWS cloud services, modern ETL architecture, and LLM-powered automation workflows. Experience: 3–4 years Location: Rajarhat-Newtown (Kolkata) Employment Type: Full-time, Onsite Timing: Ability to work in the US Eastern time zone. This may be relaxed to half day IST and half day US EST - based on project needs. Documents : Must have Aadhar Card, Education Certificates that are verifiable, Past company letters ( if applicable) and criminal background clearance. Key Skills Required: AI-Powered Taxonomy Audit & Enrichment: Design and develop scalable, AI-driven taxonomy audit pipelines for retail store and brand data validation. Build automated workflows leveraging LLMs (GPT-4o / OpenAI APIs) for classification, enrichment, and ontology standardization, using Instructor and Pydantic for reliable structured outputs. Integrate web research and scraping systems (Serper API, ScrapingBee, html2text) to validate structured and unstructured data. Develop human-in-the-loop review workflows using Label Studio for confirm/edit/reject audit processes. Improve taxonomy coverage and entity-resolution accuracy through AI-assisted clustering and enrichment of unmapped transaction data. Data Engineering & Pipeline Development: Build and maintain modular, reusable ETL/data pipeline frameworks. Refactor legacy reporting systems into modern, maintainable architectures with reusable SQL modules and query builders. Develop validation frameworks, logging systems, automated migration workflows, and configurable comparison contexts. Orchestrate workflows with Apache Airflow (DAGs, PythonOperator, XCom) and cloud-native AWS services. Ensure backward compatibility and production stability during migration initiatives. Reporting & Cloud Infrastructure: Develop and optimize advanced SQL queries and reporting pipelines on Amazon Redshift / Redshift Serverless and PostgreSQL (RDS). Manage data workflows using AWS services including S3, Lambda, Glue, CloudWatch, SSM Parameter Store, and Secrets Manager. Monitor production pipelines, troubleshoot issues, and improve performance and reliability. Collaborate with cross-functional teams across Data Engineering, AI/ML, QA, and Product. Required Skills & Experience: 3–4 years of experience in Python-based data engineering or backend engineering. Strong proficiency in Python, including pandas, requests, psycopg2, and boto3, with solid modular application development. Hands-on experience with Apache Airflow (DAGs, PythonOperator, XCom). Strong advanced SQL skills and a solid grasp of data warehousing concepts. Experience with Amazon Redshift and PostgreSQL. Sound understanding of ETL/data pipeline architecture and workflow orchestration. Hands-on experience with AWS services: S3, Lambda, Glue, CloudWatch, SSM Parameter Store, and Secrets Manager. Experience integrating LLM APIs (GPT-4o / OpenAI) into production workflows. Familiarity with web scraping, search APIs, and data enrichment systems. Experience with Git/GitHub, Jira, and Confluence. Strong debugging, problem-solving, and analytical skills. Good to Have: Experience with Instructor, Pydantic, or AI workflow orchestration frameworks. Exposure to Label Studio or other human-review annotation systems. Experience with AI-assisted entity resolution and taxonomy/ontology systems. Familiarity with scalable, modular ETL framework design. Background in retail transaction data or taxonomy/master-data management. Tech Stack: Languages & Libraries: Python, advanced SQL, pandas, boto3, psycopg2, requests Orchestration: Apache Airflow AI / LLM: GPT-4o / OpenAI APIs, Instructor, Pydantic Data & Warehousing: Amazon Redshift / Redshift Serverless, PostgreSQL (RDS) AWS: S3, Lambda, Glue, CloudWatch, SSM Parameter Store, Secrets Manager Scraping & Search: Serper API, ScrapingBee, html2text Human Review: Label Studio Collaboration: Git/GitHub, Jira, Confluence Preferred Candidate Profile: Self-driven, with end-to-end ownership of data workflows. Comfortable in fast-paced AI/data engineering environments. Strong communication and collaboration skills. Passionate about building scalable, AI-assisted automation systems.