Overview
Project description
We are looking for a skilled Document AI / NLP Engineer to develop intelligent systems that extract meaningful data from documents such as PDFs, scanned images, and forms. In this role, you will build document processing pipelines using OCR and NLP technologies, fine-tune ML models for tasks like entity extraction and classification, and integrate those solutions into scalable cloud-based applications.
You will collaborate with cross-functional teams to deliver high-performance, production-ready pipelines and stay up to date with advancements in the document understanding and machine learning space.
Responsibilities
Design, build, and optimize document parsing pipelines using tools like Amazon Textract, Azure Form Recognizer, or Google Document AI.
Perform data preprocessing, labeling, and annotation for training machine learning and NLP models.
Fine-tune or train models for tasks such as Named Entity Recognition (NER), text classification, and layout understanding using PyTorch, TensorFlow, or HuggingFace Transformers.
Integrate document intelligence capabilities into larger workflows and applications using REST APIs, microservices, and cloud components (e.g., AWS Lambda, S3, SageMaker).
Evaluate model and OCR accuracy, applying post-processing techniques or heuristics to improve precision and recall.
Collaborate with data engineers, DevOps, and product teams to ensure solutions are robust, scalable, and meet business KPIs.
Monitor, debug, and continuously enhance deployed document AI solutions.
Maintain up-to-date knowledge of industry trends in OCR, Document AI, NLP, and machine learning.
Skills
Must have
4-5 years of hands-on experience in machine learning, document AI, or NLP-focused roles.
Strong expertise in OCR tools and frameworks, especially Amazon Textract, Azure Form Recognizer, Google Document AI, or open-source tools like Tesseract, LayoutLM, or PaddleOCR.
Solid programming skills in Python and familiarity with ML/NLP libraries: scikit-learn, spaCy, transformers, PyTorch, TensorFlow, etc.
Experience working with structured and unstructured data formats, including PDF, images, JSON, and XML.
Hands-on experience with REST APIs, microservices, and integrating ML models into production pipelines.
Working knowledge of cloud platforms, especially AWS (S3, Lambda, SageMaker) or their equivalents.
Understanding of NLP techniques such as NER, text classification, and language modeling.
Strong debugging, problem-solving, and analytical skills.
Clear verbal and written communication skills for technical and cross-functional collaboration.
Nice to have
N/A
Other
Languages
English: B2 Upper Intermediate
Seniority
Senior
Chennai, India
Req. VR-116250
AI/ML
BCM Industry
28/07/2025
Req. VR-116250