Evaluation & Testing
Measure and improve the quality of your LLM outputs with built-in metrics.
Orka's evaluation system helps you systematically test your RAG pipelines and LLM outputs against ground truth data.
ORKA — EVALUATION PIPELINE
EvalCase[]
input + expectedOutput + knowledge
Retrieve
→
Generate
→
actualOutput
Metrics Evaluation
relevance
0.92
correctness
0.88
faithfulness
0.95
hallucination
0.05
EvalSummary
totalCases: numbermetrics: MetricStatsavgLatency: number
Dataset
Evaluation
Results
1. Define Evaluation Dataset
eval-dataset.ts
import { createOrka, OpenAIAdapter, MemoryVectorAdapter, type EvalCase } from 'orkajs'; const orka = createOrka({ llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }), vectorDB: new MemoryVectorAdapter(),}); // Create a testing knowledge baseawait orka.knowledge.create({ name: 'docs', source: [ 'Orka AI is a TypeScript framework for LLM systems.', 'Orka AI supports OpenAI, Anthropic, Mistral and Ollama.', 'Vector databases supported: Pinecone, Qdrant, Chroma, in-memory.', 'Chunking is automatic with 1000 characters by default.', ],}); // Define the evaluation datasetconst dataset: EvalCase[] = [ { input: 'What is Orka AI?', expectedOutput: 'Orka AI is a TypeScript framework for LLM systems.', knowledge: 'docs', }, { input: 'What LLM providers are supported?', expectedOutput: 'OpenAI, Anthropic, Mistral and Ollama.', knowledge: 'docs', }, { input: 'What vector databases can be used?', expectedOutput: 'Pinecone, Qdrant, Chroma and an in-memory adapter.', knowledge: 'docs', }, { input: 'How does chunking work?', expectedOutput: 'Chunking is automatic with 1000 characters.', knowledge: 'docs', },];2. Run Evaluation
run-eval.ts
// Run the evaluationconst summary = await orka.evaluate({ dataset, metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'], onResult: (result, index) => { const scores = result.metrics.map(m => `${m.name}=${m.score.toFixed(2)}`).join(', '); console.log(`[${index + 1}/${dataset.length}] "${result.input.slice(0, 40)}..." → ${scores}`); },}); console.log('\n📊 Summary:');console.log(` Total: ${summary.totalCases} cases`);console.log(` Average latency: ${Math.round(summary.averageLatencyMs)}ms`);console.log(` Total tokens: ${summary.totalTokens}`); console.log('\n Metrics:');for (const [name, stats] of Object.entries(summary.metrics)) { console.log(` ${name}: avg=${stats.average.toFixed(2)}, min=${stats.min.toFixed(2)}, max=${stats.max.toFixed(2)}`);}3. Available Metrics
metrics.ts
// Built-in metricsconst metrics = [ 'relevance', // Is the response relevant to the question? 'correctness', // Does the response match the expected output? 'faithfulness', // Is the response faithful to the provided context? 'hallucination', // Does the response contain invented information? 'coherence', // Is the response coherent and well-structured? 'conciseness', // Is the response concise without unnecessary information?]; // Use specific metricsconst summary = await orka.evaluate({ dataset, metrics: ['relevance', 'correctness'], // Only these 2 metrics}); // Custom metricsconst customSummary = await orka.evaluate({ dataset, metrics: ['relevance'], customMetrics: [ { name: 'contains_keywords', evaluate: async (input, output, expected) => { const keywords = ['TypeScript', 'LLM', 'framework']; const found = keywords.filter(k => output.includes(k)); return found.length / keywords.length; }, }, ],});4. Complete Example
evaluation-complete.ts
import { createOrka, OpenAIAdapter, MemoryVectorAdapter, type EvalCase } from 'orkajs'; async function main() { const orka = createOrka({ llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }), vectorDB: new MemoryVectorAdapter(), }); // Create a knowledge base await orka.knowledge.create({ name: 'docs', source: [ 'Orka AI is a TypeScript framework for LLM systems.', 'Orka AI supports OpenAI, Anthropic, Mistral and Ollama.', 'Vector bases: Pinecone, Qdrant, Chroma, in-memory.', 'Chunking is automatic with 1000 characters by default.', 'Orka AI includes an integrated evaluation system.', ], }); // Evaluation dataset const dataset: EvalCase[] = [ { input: 'What is Orka AI?', expectedOutput: 'Orka AI is a TypeScript framework for LLM systems.', knowledge: 'docs', }, { input: 'What LLM providers are supported?', expectedOutput: 'OpenAI, Anthropic, Mistral and Ollama.', knowledge: 'docs', }, { input: 'What vector databases can be used?', expectedOutput: 'Pinecone, Qdrant, Chroma and in-memory.', knowledge: 'docs', }, { input: 'How does chunking work?', expectedOutput: 'Chunking is automatic with 1000 characters.', knowledge: 'docs', }, ]; console.log('🧪 Evaluation in progress...\n'); const summary = await orka.evaluate({ dataset, metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'], onResult: (result, index) => { const scores = result.metrics.map(m => `${m.name}=${m.score.toFixed(2)}`).join(', '); console.log(` [${index + 1}/${dataset.length}] "${result.input.slice(0, 40)}..." → ${scores}`); }, }); console.log('\n📊 Summary:'); console.log(` Total: ${summary.totalCases} cases`); console.log(` Average latency: ${Math.round(summary.averageLatencyMs)}ms`); console.log(` Total tokens: ${summary.totalTokens}`); console.log('\n Metrics:'); for (const [name, stats] of Object.entries(summary.metrics)) { console.log(` ${name}: avg=${stats.average.toFixed(2)}, min=${stats.min.toFixed(2)}, max=${stats.max.toFixed(2)}`); } await orka.knowledge.delete('docs');} main().catch(console.error);Understanding Metrics
relevance (0-1)
Measures how relevant the answer is to the question asked.
correctness (0-1)
Compares the answer to the expected output for factual accuracy.
faithfulness (0-1)
Checks if the answer is grounded in the provided context.
hallucination (0-1)
Detects fabricated information not present in the context. Lower is better.