OrkaJS
Orka.JS

Evaluation & Testing

Measure and improve the quality of your LLM outputs with built-in metrics.

Orka's evaluation system helps you systematically test your RAG pipelines and LLM outputs against ground truth data.

ORKA — EVALUATION PIPELINE
EvalCase[]
input + expectedOutput + knowledge
Retrieve
Generate
actualOutput
Metrics Evaluation
relevance
0.92
correctness
0.88
faithfulness
0.95
hallucination
0.05
EvalSummary
totalCases: numbermetrics: MetricStatsavgLatency: number
Dataset
Evaluation
Results

1. Define Evaluation Dataset

eval-dataset.ts
import { createOrka, OpenAIAdapter, MemoryVectorAdapter, type EvalCase } from 'orkajs';
 
const orka = createOrka({
llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
vectorDB: new MemoryVectorAdapter(),
});
 
// Create a testing knowledge base
await orka.knowledge.create({
name: 'docs',
source: [
'Orka AI is a TypeScript framework for LLM systems.',
'Orka AI supports OpenAI, Anthropic, Mistral and Ollama.',
'Vector databases supported: Pinecone, Qdrant, Chroma, in-memory.',
'Chunking is automatic with 1000 characters by default.',
],
});
 
// Define the evaluation dataset
const dataset: EvalCase[] = [
{
input: 'What is Orka AI?',
expectedOutput: 'Orka AI is a TypeScript framework for LLM systems.',
knowledge: 'docs',
},
{
input: 'What LLM providers are supported?',
expectedOutput: 'OpenAI, Anthropic, Mistral and Ollama.',
knowledge: 'docs',
},
{
input: 'What vector databases can be used?',
expectedOutput: 'Pinecone, Qdrant, Chroma and an in-memory adapter.',
knowledge: 'docs',
},
{
input: 'How does chunking work?',
expectedOutput: 'Chunking is automatic with 1000 characters.',
knowledge: 'docs',
},
];

2. Run Evaluation

run-eval.ts
// Run the evaluation
const summary = await orka.evaluate({
dataset,
metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'],
onResult: (result, index) => {
const scores = result.metrics.map(m => `${m.name}=${m.score.toFixed(2)}`).join(', ');
console.log(`[${index + 1}/${dataset.length}] "${result.input.slice(0, 40)}..." → ${scores}`);
},
});
 
console.log('\n📊 Summary:');
console.log(` Total: ${summary.totalCases} cases`);
console.log(` Average latency: ${Math.round(summary.averageLatencyMs)}ms`);
console.log(` Total tokens: ${summary.totalTokens}`);
 
console.log('\n Metrics:');
for (const [name, stats] of Object.entries(summary.metrics)) {
console.log(` ${name}: avg=${stats.average.toFixed(2)}, min=${stats.min.toFixed(2)}, max=${stats.max.toFixed(2)}`);
}

3. Available Metrics

metrics.ts
// Built-in metrics
const metrics = [
'relevance', // Is the response relevant to the question?
'correctness', // Does the response match the expected output?
'faithfulness', // Is the response faithful to the provided context?
'hallucination', // Does the response contain invented information?
'coherence', // Is the response coherent and well-structured?
'conciseness', // Is the response concise without unnecessary information?
];
 
// Use specific metrics
const summary = await orka.evaluate({
dataset,
metrics: ['relevance', 'correctness'], // Only these 2 metrics
});
 
// Custom metrics
const customSummary = await orka.evaluate({
dataset,
metrics: ['relevance'],
customMetrics: [
{
name: 'contains_keywords',
evaluate: async (input, output, expected) => {
const keywords = ['TypeScript', 'LLM', 'framework'];
const found = keywords.filter(k => output.includes(k));
return found.length / keywords.length;
},
},
],
});

4. Complete Example

evaluation-complete.ts
import { createOrka, OpenAIAdapter, MemoryVectorAdapter, type EvalCase } from 'orkajs';
 
async function main() {
const orka = createOrka({
llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
vectorDB: new MemoryVectorAdapter(),
});
 
// Create a knowledge base
await orka.knowledge.create({
name: 'docs',
source: [
'Orka AI is a TypeScript framework for LLM systems.',
'Orka AI supports OpenAI, Anthropic, Mistral and Ollama.',
'Vector bases: Pinecone, Qdrant, Chroma, in-memory.',
'Chunking is automatic with 1000 characters by default.',
'Orka AI includes an integrated evaluation system.',
],
});
 
// Evaluation dataset
const dataset: EvalCase[] = [
{
input: 'What is Orka AI?',
expectedOutput: 'Orka AI is a TypeScript framework for LLM systems.',
knowledge: 'docs',
},
{
input: 'What LLM providers are supported?',
expectedOutput: 'OpenAI, Anthropic, Mistral and Ollama.',
knowledge: 'docs',
},
{
input: 'What vector databases can be used?',
expectedOutput: 'Pinecone, Qdrant, Chroma and in-memory.',
knowledge: 'docs',
},
{
input: 'How does chunking work?',
expectedOutput: 'Chunking is automatic with 1000 characters.',
knowledge: 'docs',
},
];
 
console.log('🧪 Evaluation in progress...\n');
 
const summary = await orka.evaluate({
dataset,
metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'],
onResult: (result, index) => {
const scores = result.metrics.map(m => `${m.name}=${m.score.toFixed(2)}`).join(', ');
console.log(` [${index + 1}/${dataset.length}] "${result.input.slice(0, 40)}..." → ${scores}`);
},
});
 
console.log('\n📊 Summary:');
console.log(` Total: ${summary.totalCases} cases`);
console.log(` Average latency: ${Math.round(summary.averageLatencyMs)}ms`);
console.log(` Total tokens: ${summary.totalTokens}`);
console.log('\n Metrics:');
for (const [name, stats] of Object.entries(summary.metrics)) {
console.log(` ${name}: avg=${stats.average.toFixed(2)}, min=${stats.min.toFixed(2)}, max=${stats.max.toFixed(2)}`);
}
 
await orka.knowledge.delete('docs');
}
 
main().catch(console.error);

Understanding Metrics

relevance (0-1)

Measures how relevant the answer is to the question asked.

correctness (0-1)

Compares the answer to the expected output for factual accuracy.

faithfulness (0-1)

Checks if the answer is grounded in the provided context.

hallucination (0-1)

Detects fabricated information not present in the context. Lower is better.