Evaluation & Testing

Measure and improve the quality of your LLM outputs with built-in metrics.

Orka's evaluation system helps you systematically test your RAG pipelines and LLM outputs against ground truth data.

ORKA — EVALUATION PIPELINE

EvalCase[]

input + expectedOutput + knowledge

Retrieve

→

Generate

→

actualOutput

Metrics Evaluation

relevance

0.92

correctness

0.88

faithfulness

0.95

hallucination

0.05

EvalSummary

totalCases: numbermetrics: MetricStatsavgLatency: number

Dataset

Evaluation

Results

1. Define Evaluation Dataset

eval-dataset.ts

import { createOrka, OpenAIAdapter, MemoryVectorAdapter, type EvalCase } from 'orkajs';
 
const orka = createOrka({
  llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
  vectorDB: new MemoryVectorAdapter(),
});
 
// Create a testing knowledge base
await orka.knowledge.create({
  name: 'docs',
  source: [
    'Orka AI is a TypeScript framework for LLM systems.',
    'Orka AI supports OpenAI, Anthropic, Mistral and Ollama.',
    'Vector databases supported: Pinecone, Qdrant, Chroma, in-memory.',
    'Chunking is automatic with 1000 characters by default.',
  ],
});
 
// Define the evaluation dataset
const dataset: EvalCase[] = [
  {
    input: 'What is Orka AI?',
    expectedOutput: 'Orka AI is a TypeScript framework for LLM systems.',
    knowledge: 'docs',
  },
  {
    input: 'What LLM providers are supported?',
    expectedOutput: 'OpenAI, Anthropic, Mistral and Ollama.',
    knowledge: 'docs',
  },
  {
    input: 'What vector databases can be used?',
    expectedOutput: 'Pinecone, Qdrant, Chroma and an in-memory adapter.',
    knowledge: 'docs',
  },
  {
    input: 'How does chunking work?',
    expectedOutput: 'Chunking is automatic with 1000 characters.',
    knowledge: 'docs',
  },
];

2. Run Evaluation

run-eval.ts

// Run the evaluation
const summary = await orka.evaluate({
  dataset,
  metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'],
  onResult: (result, index) => {
    const scores = result.metrics.map(m => `${m.name}=${m.score.toFixed(2)}`).join(', ');
    console.log(`[${index + 1}/${dataset.length}] "${result.input.slice(0, 40)}..." → ${scores}`);
  },
});
 
console.log('\n📊 Summary:');
console.log(`  Total: ${summary.totalCases} cases`);
console.log(`  Average latency: ${Math.round(summary.averageLatencyMs)}ms`);
console.log(`  Total tokens: ${summary.totalTokens}`);
 
console.log('\n  Metrics:');
for (const [name, stats] of Object.entries(summary.metrics)) {
  console.log(`    ${name}: avg=${stats.average.toFixed(2)}, min=${stats.min.toFixed(2)}, max=${stats.max.toFixed(2)}`);
}

3. Available Metrics

metrics.ts

// Built-in metrics
const metrics = [
  'relevance',     // Is the response relevant to the question?
  'correctness',   // Does the response match the expected output?
  'faithfulness',  // Is the response faithful to the provided context?
  'hallucination', // Does the response contain invented information?
  'coherence',     // Is the response coherent and well-structured?
  'conciseness',   // Is the response concise without unnecessary information?
];
 
// Use specific metrics
const summary = await orka.evaluate({
  dataset,
  metrics: ['relevance', 'correctness'], // Only these 2 metrics
});
 
// Custom metrics
const customSummary = await orka.evaluate({
  dataset,
  metrics: ['relevance'],
  customMetrics: [
    {
      name: 'contains_keywords',
      evaluate: async (input, output, expected) => {
        const keywords = ['TypeScript', 'LLM', 'framework'];
        const found = keywords.filter(k => output.includes(k));
        return found.length / keywords.length;
      },
    },
  ],
});

4. Complete Example

evaluation-complete.ts

import { createOrka, OpenAIAdapter, MemoryVectorAdapter, type EvalCase } from 'orkajs';
 
async function main() {
  const orka = createOrka({
    llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
    vectorDB: new MemoryVectorAdapter(),
  });
 
  // Create a knowledge base
  await orka.knowledge.create({
    name: 'docs',
    source: [
      'Orka AI is a TypeScript framework for LLM systems.',
      'Orka AI supports OpenAI, Anthropic, Mistral and Ollama.',
      'Vector bases: Pinecone, Qdrant, Chroma, in-memory.',
      'Chunking is automatic with 1000 characters by default.',
      'Orka AI includes an integrated evaluation system.',
    ],
  });
 
  // Evaluation dataset
  const dataset: EvalCase[] = [
    {
      input: 'What is Orka AI?',
      expectedOutput: 'Orka AI is a TypeScript framework for LLM systems.',
      knowledge: 'docs',
    },
    {
      input: 'What LLM providers are supported?',
      expectedOutput: 'OpenAI, Anthropic, Mistral and Ollama.',
      knowledge: 'docs',
    },
    {
      input: 'What vector databases can be used?',
      expectedOutput: 'Pinecone, Qdrant, Chroma and in-memory.',
      knowledge: 'docs',
    },
    {
      input: 'How does chunking work?',
      expectedOutput: 'Chunking is automatic with 1000 characters.',
      knowledge: 'docs',
    },
  ];
 
  console.log('🧪 Evaluation in progress...\n');
 
  const summary = await orka.evaluate({
    dataset,
    metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'],
    onResult: (result, index) => {
      const scores = result.metrics.map(m => `${m.name}=${m.score.toFixed(2)}`).join(', ');
      console.log(`  [${index + 1}/${dataset.length}] "${result.input.slice(0, 40)}..." → ${scores}`);
    },
  });
 
  console.log('\n📊 Summary:');
  console.log(`  Total: ${summary.totalCases} cases`);
  console.log(`  Average latency: ${Math.round(summary.averageLatencyMs)}ms`);
  console.log(`  Total tokens: ${summary.totalTokens}`);
  console.log('\n  Metrics:');
  for (const [name, stats] of Object.entries(summary.metrics)) {
    console.log(`    ${name}: avg=${stats.average.toFixed(2)}, min=${stats.min.toFixed(2)}, max=${stats.max.toFixed(2)}`);
  }
 
  await orka.knowledge.delete('docs');
}
 
main().catch(console.error);

Understanding Metrics

relevance (0-1)

Measures how relevant the answer is to the question asked.

correctness (0-1)

Compares the answer to the expected output for factual accuracy.

faithfulness (0-1)

Checks if the answer is grounded in the provided context.

hallucination (0-1)

Detects fabricated information not present in the context. Lower is better.