Overview
Job Responsibilities:
· Design, develop, and execute automated evaluation suites and test cases specifically targeting AI/LLM components, focusing on aspects like response quality, factual accuracy, safety, and task completion.
· Implement and manage batch testing processes using curated datasets to assess model performance, identify regressions, and benchmark different model versions or prompts.
· Develop, maintain, and enhance test and evaluation frameworks using libraries such as Promptflow, DeepEval, Ragas, and similar LLM evaluation tools.
· Define and implement comprehensive test strategies to evaluate LLM outputs for accuracy, relevance, coherence, safety (toxicity, bias), hallucination reduction, and consistency, using both automated metrics and potentially qualitative review processes.
· Collaborate closely with developers, data scientists, and prompt engineers to understand model behavior, identify edge cases, potential biases, and failure modes in AI models and agents
· Test and validate components of Retrieval-Augmented Generation (RAG) pipelines, including retriever performance, chunking strategies, and generator quality.
· Evaluate the end-to-end functionality and performance of AI-driven workflows within telecom applications against defined benchmarks.
· Continuously research and improve testing methodologies and metrics for AI/LLM applications, incorporating industry best practices in automated evaluation and validation.
· Document evaluation results and findings, providing actionable feedback to development teams to enhance AI model robustness, reliability, and overall quality.
Job Type: Full-time
Pay: ₹9,486.26 - ₹49,562.64 per month
Benefits:
- Work from home
Schedule:
- Monday to Friday
Application Question(s):
- Should have knowledge of AI/ML/LLM development
Experience:
- Python: 2 years (Required)
- Selenium Automation: 2 years (Required)
- Promptflow or DeepEval or Ragas,: 1 year (Preferred)
- Machine learning/LLM: 2 years (Required)
- REST API: 2 years (Preferred)
- Test Strategy: 2 years (Preferred)
Work Location: Remote