Évaluation & Tests
Mesurez et améliorez la qualité de vos sorties LLM avec des métriques intégrées.
Le système d'évaluation d'Orka vous aide à tester systématiquement vos pipelines RAG et sorties LLM par rapport aux données de référence.
ORKA — EVALUATION PIPELINE
EvalCase[]
input + expectedOutput + knowledge
Retrieve
→
Generate
→
actualOutput
Évaluation des Métriques
relevance
0.92
correctness
0.88
faithfulness
0.95
hallucination
0.05
EvalSummary
totalCases: numbermetrics: MetricStatsavgLatency: number
Dataset
Evaluation
Results
1. Définir le Dataset d'Évaluation
eval-dataset.ts
import { createOrka, OpenAIAdapter, MemoryVectorAdapter, type EvalCase } from 'orkajs'; const orka = createOrka({ llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }), vectorDB: new MemoryVectorAdapter(),}); // Create a testing knowledge baseawait orka.knowledge.create({ name: 'docs', source: [ 'Orka AI is a TypeScript framework for LLM systems.', 'Orka AI supports OpenAI, Anthropic, Mistral and Ollama.', 'Vector databases supported: Pinecone, Qdrant, Chroma, in-memory.', 'Chunking is automatic with 1000 characters by default.', ],}); // Define the evaluation datasetconst dataset: EvalCase[] = [ { input: 'What is Orka AI?', expectedOutput: 'Orka AI is a TypeScript framework for LLM systems.', knowledge: 'docs', }, { input: 'What LLM providers are supported?', expectedOutput: 'OpenAI, Anthropic, Mistral and Ollama.', knowledge: 'docs', }, { input: 'What vector databases can be used?', expectedOutput: 'Pinecone, Qdrant, Chroma and an in-memory adapter.', knowledge: 'docs', }, { input: 'How does chunking work?', expectedOutput: 'Chunking is automatic with 1000 characters.', knowledge: 'docs', },];2. Exécuter l'Évaluation
run-eval.ts
// Run the evaluationconst summary = await orka.evaluate({ dataset, metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'], onResult: (result, index) => { const scores = result.metrics.map(m => `${m.name}=${m.score.toFixed(2)}`).join(', '); console.log(`[${index + 1}/${dataset.length}] "${result.input.slice(0, 40)}..." → ${scores}`); },}); console.log('\n📊 Summary:');console.log(` Total: ${summary.totalCases} cases`);console.log(` Average latency: ${Math.round(summary.averageLatencyMs)}ms`);console.log(` Total tokens: ${summary.totalTokens}`); console.log('\n Metrics:');for (const [name, stats] of Object.entries(summary.metrics)) { console.log(` ${name}: avg=${stats.average.toFixed(2)}, min=${stats.min.toFixed(2)}, max=${stats.max.toFixed(2)}`);}3. Métriques Disponibles
metrics.ts
// Built-in metricsconst metrics = [ 'relevance', // Is the response relevant to the question? 'correctness', // Does the response match the expected output? 'faithfulness', // Is the response faithful to the provided context? 'hallucination', // Does the response contain invented information? 'coherence', // Is the response coherent and well-structured? 'conciseness', // Is the response concise without unnecessary information?]; // Use specific metricsconst summary = await orka.evaluate({ dataset, metrics: ['relevance', 'correctness'], // Only these 2 metrics}); // Custom metricsconst customSummary = await orka.evaluate({ dataset, metrics: ['relevance'], customMetrics: [ { name: 'contains_keywords', evaluate: async (input, output, expected) => { const keywords = ['TypeScript', 'LLM', 'framework']; const found = keywords.filter(k => output.includes(k)); return found.length / keywords.length; }, }, ],});4. Exemple Complet
evaluation-complete.ts
import { createOrka, OpenAIAdapter, MemoryVectorAdapter, type EvalCase } from 'orkajs'; async function main() { const orka = createOrka({ llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }), vectorDB: new MemoryVectorAdapter(), }); // Create a knowledge base await orka.knowledge.create({ name: 'docs', source: [ 'Orka AI is a TypeScript framework for LLM systems.', 'Orka AI supports OpenAI, Anthropic, Mistral and Ollama.', 'Vector bases: Pinecone, Qdrant, Chroma, in-memory.', 'Chunking is automatic with 1000 characters by default.', 'Orka AI includes an integrated evaluation system.', ], }); // Evaluation dataset const dataset: EvalCase[] = [ { input: 'What is Orka AI?', expectedOutput: 'Orka AI is a TypeScript framework for LLM systems.', knowledge: 'docs', }, { input: 'What LLM providers are supported?', expectedOutput: 'OpenAI, Anthropic, Mistral and Ollama.', knowledge: 'docs', }, { input: 'What vector databases can be used?', expectedOutput: 'Pinecone, Qdrant, Chroma and in-memory.', knowledge: 'docs', }, { input: 'How does chunking work?', expectedOutput: 'Chunking is automatic with 1000 characters.', knowledge: 'docs', }, ]; console.log('🧪 Evaluation in progress...\n'); const summary = await orka.evaluate({ dataset, metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'], onResult: (result, index) => { const scores = result.metrics.map(m => `${m.name}=${m.score.toFixed(2)}`).join(', '); console.log(` [${index + 1}/${dataset.length}] "${result.input.slice(0, 40)}..." → ${scores}`); }, }); console.log('\n📊 Summary:'); console.log(` Total: ${summary.totalCases} cases`); console.log(` Average latency: ${Math.round(summary.averageLatencyMs)}ms`); console.log(` Total tokens: ${summary.totalTokens}`); console.log('\n Metrics:'); for (const [name, stats] of Object.entries(summary.metrics)) { console.log(` ${name}: avg=${stats.average.toFixed(2)}, min=${stats.min.toFixed(2)}, max=${stats.max.toFixed(2)}`); } await orka.knowledge.delete('docs');} main().catch(console.error);Comprendre les Métriques
relevance (0-1)
Mesure la pertinence de la réponse par rapport à la question posée.
correctness (0-1)
Compare la réponse à la sortie attendue pour l'exactitude factuelle.
faithfulness (0-1)
Vérifie si la réponse est basée sur le contexte fourni.
hallucination (0-1)
Détecte les informations inventées absentes du contexte. Plus bas = mieux.