Mysore, Karnataka, India
Information Technology
Full-Time
RCM Business Solutions
Overview
Job Overview
We are seeking a highly skilled and motivated LLM Evaluation Framework Developer to design, build, and maintain robust frameworks for evaluating large language models (LLMs). You will work closely with ML researchers, engineers, and product teams to define metrics, automate evaluations, integrate datasets, and ensure model behaviour aligns with safety, quality, and performance expectations.
Key Responsibilities
Integrate and customize tools like Giskard, RAGAS, DeepEval, Opik/Comet, TruLens, or similar.
Define and implement custom metrics for specific use cases like RAG, Agent performance, Guardrails compliance, etc.
Curate or generate high-quality evaluation datasets for various domains (e.g., medical, finance, legal, general QA, code generation).
Collaborate with LLM application developers to instrument tracing and logging to capture model behaviour in real-world flows.
Implement dashboarding and reporting to visualize performance trends, regressions, and comparison across model versions.
Evaluate model responses using structured prompts, chain-of-thought techniques, adversarial tests, and A/B comparisons.
Support red-teaming and stress testing efforts to identify vulnerabilities or ethical risks in model outputs.
Required Skills & Qualifications
Core Technical Skills :
We are seeking a highly skilled and motivated LLM Evaluation Framework Developer to design, build, and maintain robust frameworks for evaluating large language models (LLMs). You will work closely with ML researchers, engineers, and product teams to define metrics, automate evaluations, integrate datasets, and ensure model behaviour aligns with safety, quality, and performance expectations.
Key Responsibilities
- Design and implement evaluation frameworks for benchmarking LLMs across dimensions such as accuracy, robustness, reasoning, safety, and hallucination.
Integrate and customize tools like Giskard, RAGAS, DeepEval, Opik/Comet, TruLens, or similar.
Define and implement custom metrics for specific use cases like RAG, Agent performance, Guardrails compliance, etc.
Curate or generate high-quality evaluation datasets for various domains (e.g., medical, finance, legal, general QA, code generation).
Collaborate with LLM application developers to instrument tracing and logging to capture model behaviour in real-world flows.
Implement dashboarding and reporting to visualize performance trends, regressions, and comparison across model versions.
Evaluate model responses using structured prompts, chain-of-thought techniques, adversarial tests, and A/B comparisons.
Support red-teaming and stress testing efforts to identify vulnerabilities or ethical risks in model outputs.
Required Skills & Qualifications
Core Technical Skills :
- Proficiency in Python with experience in NLP, ML/LLM libraries (e.g. Hugging Face, Lang Chain, OpenAI SDK, Cohere).
- Experience building evaluation pipelines or benchmarks for ML/LLM systems.
- Familiarity with RAG evaluation, agentic evaluation, safety/guardrail testing, and LLM performance metrics.
- Strong grasp of prompt engineering, retrieval techniques, and generative model behaviour.
- Giskard, RAGAS, DeepEval, TruLens, Lang Smith, Opik/Comet, Weights & Biases, or similar.
- Working knowledge of vector stores (e.g., FAISS, Weaviate, Pinecone) and embedding-based evaluation.
- Familiarity with CI/CD pipelines, unit and integration testing for LLM apps.
- Understanding of data versioning, model versioning, and test reproducibility.
- Prior experience developing or maintaining LLM-based applications (chatbots, copilots, RAG systems).
- Background in ML research, applied NLP, or machine learning infrastructure.
- Exposure to LLM guardrails design (e.g., jailbreaking prevention, content filtering).
- Experience with open-source contribution in the LLM evaluation or tooling space.
- Strong communication and documentation abilities.
- Comfort working in ambiguous, fast-paced, and research-heavy environments.
- Passion for ensuring LLM reliability, safety, and responsible deployment.
Similar Jobs
View All
Talk to us
Feel free to call, email, or hit us up on our social media accounts.
Email
info@antaltechjobs.in