Evaluation Metrics
Measure the quality of your LLM responses with built-in and custom evaluation metrics.
Why Evaluation?
LLM outputs are non-deterministic and can vary in quality. Evaluation metrics help you measure and track the quality of your RAG system, detect regressions, and compare different configurations. Orka provides both built-in metrics and the ability to create custom ones.
import { createOrka, OpenAIAdapter, MemoryVectorDB } from 'orkajs'; const orka = createOrka({ llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }), vectorDB: new MemoryVectorDB(),}); // Define your evaluation datasetconst dataset = [ { input: 'What is Orka AI?', expectedOutput: 'A TypeScript framework for building LLM applications.', knowledge: 'docs', // Knowledge base to use for RAG }, { input: 'How do I install Orka?', expectedOutput: 'Run npm install orkajs', knowledge: 'docs', },]; // Run evaluation with multiple metricsconst summary = await orka.evaluate({ dataset, metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'],}); console.log(summary.metrics);// {// relevance: { average: 0.95, min: 0.9, max: 1.0 },// correctness: { average: 0.88, min: 0.85, max: 0.92 },// faithfulness: { average: 0.92, min: 0.88, max: 0.96 },// hallucination: { average: 0.05, min: 0.0, max: 0.1 },// } console.log(summary.results); // Detailed per-case resultsconsole.log(summary.passed); // true if all thresholds met# Built-in Metrics
Orka provides five built-in metrics that cover the most important aspects of RAG quality:
relevanceHigher is betterMeasures how relevant the generated answer is to the original question. Uses LLM-as-judge to score from 0 (completely irrelevant) to 1 (perfectly relevant).
Target: > 0.8 for production systems
correctnessHigher is betterCompares the generated answer to the expected output using semantic similarity. Accounts for paraphrasing and different wordings that convey the same meaning.
Target: > 0.7 (allows for paraphrasing)
faithfulnessHigher is betterMeasures whether the answer is grounded in the retrieved context. A faithful answer only contains information that can be traced back to the source documents.
Target: > 0.85 for RAG systems
hallucinationLower is betterDetects information in the answer that is NOT present in the provided context. This is the inverse of faithfulness and specifically flags fabricated content.
Target: < 0.15 for production systems
costInformationalTracks token consumption per evaluation case. Useful for monitoring costs and optimizing prompt lengths.
# Custom Metrics
Create custom metrics for domain-specific quality checks. A custom metric is an async function that receives the evaluation context and returns a score.
import type { MetricFn } from 'orkajs'; // Custom metric: Check professionalism of toneconst toneCheck: MetricFn = async ({ input, output, context, llm }) => { const result = await llm.generate( `Rate the professionalism of this response on a scale of 0.0 to 1.0. Question: ${input} Response: ${output} Reply with ONLY a number.`, { temperature: 0, maxTokens: 10 } ); const score = parseFloat(result.content.trim()); return { name: 'professionalism', score: isNaN(score) ? 0 : Math.min(1, Math.max(0, score)), };}; // Custom metric: Check response lengthconst lengthCheck: MetricFn = async ({ output }) => { const wordCount = output.split(/\s+/).length; // Penalize very short or very long responses const idealLength = 50; const score = Math.max(0, 1 - Math.abs(wordCount - idealLength) / idealLength); return { name: 'length_appropriateness', score };}; // Use alongside built-in metricsconst summary = await orka.evaluate({ dataset: [...], metrics: ['relevance', 'faithfulness', toneCheck, lengthCheck],});- MetricFn Context
input: stringThe original question/input from the dataset
output: stringThe generated answer from the LLM
expectedOutput?: stringThe expected answer from the dataset (if provided)
context?: ChunkResult[]Retrieved chunks used to generate the answer
llm: LLMAdapterLLM adapter for making additional calls (e.g., LLM-as-judge)
# Evaluation Result
interface EvaluationSummary { metrics: { [metricName: string]: { average: number; // Average score across all cases min: number; // Minimum score max: number; // Maximum score stdDev: number; // Standard deviation }; }; results: EvaluationResult[]; // Per-case detailed results passed: boolean; // True if all thresholds met totalCases: number; // Number of test cases totalLatencyMs: number; // Total evaluation time totalTokens: number; // Total tokens consumed}💡 Best Practices
- Use all four metrics for RAG: relevance, correctness, faithfulness, hallucination
- A good RAG system should have faithfulness > 0.85 and hallucination < 0.15
- Create custom metrics for domain-specific quality checks (tone, length, format)
- Run evaluations in CI/CD to catch regressions before deployment
Tree-shaking Imports
// ✅ Import evaluation typesimport type { MetricFn, EvaluationSummary } from 'orkajs'; // ✅ Import built-in metrics individuallyimport { relevance, correctness, faithfulness, hallucination } from 'orkajs/evaluation';