
Overview
About us:
For over 20 years, Smart Data Solutions has been partnering with leading payer organizations to provide automation and technology solutions enabling data standardization and workflow automation. The company brings a comprehensive set of turn-key services to handle all claims and claims-related information regardless of format (paper, fax, electronic), digitizing and normalizing for seamless use by payer clients. Solutions include intelligent data capture, conversion and digitization, mailroom management, comprehensive clearinghouse services and proprietary workflow offerings. SDS’ headquarters are just outside of St. Paul, MN and leverages dedicated onshore and offshore resources as part of its service delivery model. The company counts over 420 healthcare organizations as clients, including multiple Blue Cross Blue Shield state plans, large regional health plans and leading independent TPAs, handling over 500 million transactions of varying types annually with a 98%+ customer retention rate. SDS has also invested meaningfully in automation and machine learning capabilities across its tech-enabled processes to drive scalability and greater internal operating efficiency while also improving client results.
SDS recently partnered with a leading growth-oriented investment firm, Parthenon Capital, to further accelerate expansion and product innovation.
Location: 6th Floor, Block 4A, Millenia Business Park, Phase II MGR Salai, Kandanchavadi, Perungudi Chennai 600096, India.
Smart Data Solutions is an equal opportunity employer.
All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, age, marital status, pregnancy, genetic information, or other legally protected status
To perform this job successfully, an individual must be able to perform each essential duty satisfactorily. The requirements listed above are representative of the knowledge skill and or ability required. Reasonable accommodation may be made to enable individuals with disabilities to perform essential job functions.
Due to access to Protected Healthcare Information, employees in this role must be free of felony convictions on a background check report.
Duties and Responsibilities include but are not limited to:
- Design and build ML pipelines for OCR extraction, document image processing, and text classification tasks.
- Fine-tune or prompt large language models (LLMs) (e.g., Qwen, GPT, LLaMA, Mistral) for domain-specific use cases.
- Develop systems to extract structured data from scanned or unstructured documents (PDFs, images, TIFs).
- Integrate OCR engines (Tesseract, EasyOCR, AWS Textract, etc.) and improve their accuracy via pre-/post-processing.
- Handle natural language processing (NLP) tasks such as named entity recognition (NER), summarization, classification, and semantic similarity.
- Collaborate with product managers, data engineers, and backend teams to productionize ML models.
- Evaluate models using metrics like precision, recall, F1-score, and confusion matrix, and improve model robustness and generalizability.
- Maintain proper versioning, reproducibility, and monitoring of ML models in production.
Skills and Qualifications
- 4–5 years of experience in machine learning, NLP, or AI roles
- Proficiency with Python and ML libraries such as PyTorch, TensorFlow, scikit-learn, Hugging Face Transformers.
- Experience with LLMs (open-source or proprietary), including fine-tuning or prompt engineering.
- Solid experience in OCR tools (Tesseract, PaddleOCR, etc.) and document parsing.
- Strong background in text classification, tokenization, and vectorization techniques (TF-IDF, embeddings, etc.).
- Knowledge of handling unstructured data (text, scanned images, forms).
- Familiarity with MLOps tools: MLflow, Docker, Git, and model serving frameworks.
- Ability to write clean, modular, and production-ready code.
- Experience working with medical, legal, or financial document processing.
- Exposure to vector databases (e.g., FAISS, Pinecone, Weaviate) and semantic search.
- Understanding of document layout analysis (e.g., LayoutLM, Donut, DocTR).
- Familiarity with cloud platforms (AWS, GCP, Azure) and deploying models at scale