OrkaJS
Orka.JS

Evaluation Metrics

Measure the quality of your LLM responses with built-in and custom evaluation metrics.

Why Evaluation?

LLM outputs are non-deterministic and can vary in quality. Evaluation metrics help you measure and track the quality of your RAG system, detect regressions, and compare different configurations. Orka provides both built-in metrics and the ability to create custom ones.

dataset-evaluation.ts
import { createOrka, OpenAIAdapter, MemoryVectorDB } from 'orkajs';
 
const orka = createOrka({
llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
vectorDB: new MemoryVectorDB(),
});
 
// Define your evaluation dataset
const dataset = [
{
input: 'What is Orka AI?',
expectedOutput: 'A TypeScript framework for building LLM applications.',
knowledge: 'docs', // Knowledge base to use for RAG
},
{
input: 'How do I install Orka?',
expectedOutput: 'Run npm install orkajs',
knowledge: 'docs',
},
];
 
// Run evaluation with multiple metrics
const summary = await orka.evaluate({
dataset,
metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'],
});
 
console.log(summary.metrics);
// {
// relevance: { average: 0.95, min: 0.9, max: 1.0 },
// correctness: { average: 0.88, min: 0.85, max: 0.92 },
// faithfulness: { average: 0.92, min: 0.88, max: 0.96 },
// hallucination: { average: 0.05, min: 0.0, max: 0.1 },
// }
 
console.log(summary.results); // Detailed per-case results
console.log(summary.passed); // true if all thresholds met

# Built-in Metrics

Orka provides five built-in metrics that cover the most important aspects of RAG quality:

relevanceHigher is better

Measures how relevant the generated answer is to the original question. Uses LLM-as-judge to score from 0 (completely irrelevant) to 1 (perfectly relevant).

Target: > 0.8 for production systems

correctnessHigher is better

Compares the generated answer to the expected output using semantic similarity. Accounts for paraphrasing and different wordings that convey the same meaning.

Target: > 0.7 (allows for paraphrasing)

faithfulnessHigher is better

Measures whether the answer is grounded in the retrieved context. A faithful answer only contains information that can be traced back to the source documents.

Target: > 0.85 for RAG systems

hallucinationLower is better

Detects information in the answer that is NOT present in the provided context. This is the inverse of faithfulness and specifically flags fabricated content.

Target: < 0.15 for production systems

costInformational

Tracks token consumption per evaluation case. Useful for monitoring costs and optimizing prompt lengths.

# Custom Metrics

Create custom metrics for domain-specific quality checks. A custom metric is an async function that receives the evaluation context and returns a score.

custom-metrics.ts
import type { MetricFn } from 'orkajs';
 
// Custom metric: Check professionalism of tone
const toneCheck: MetricFn = async ({ input, output, context, llm }) => {
const result = await llm.generate(
`Rate the professionalism of this response on a scale of 0.0 to 1.0.
Question: ${input}
Response: ${output}
Reply with ONLY a number.`,
{ temperature: 0, maxTokens: 10 }
);
 
const score = parseFloat(result.content.trim());
return {
name: 'professionalism',
score: isNaN(score) ? 0 : Math.min(1, Math.max(0, score)),
};
};
 
// Custom metric: Check response length
const lengthCheck: MetricFn = async ({ output }) => {
const wordCount = output.split(/\s+/).length;
// Penalize very short or very long responses
const idealLength = 50;
const score = Math.max(0, 1 - Math.abs(wordCount - idealLength) / idealLength);
return { name: 'length_appropriateness', score };
};
 
// Use alongside built-in metrics
const summary = await orka.evaluate({
dataset: [...],
metrics: ['relevance', 'faithfulness', toneCheck, lengthCheck],
});

- MetricFn Context

input: string

The original question/input from the dataset

output: string

The generated answer from the LLM

expectedOutput?: string

The expected answer from the dataset (if provided)

context?: ChunkResult[]

Retrieved chunks used to generate the answer

llm: LLMAdapter

LLM adapter for making additional calls (e.g., LLM-as-judge)

# Evaluation Result

interface EvaluationSummary {
metrics: {
[metricName: string]: {
average: number; // Average score across all cases
min: number; // Minimum score
max: number; // Maximum score
stdDev: number; // Standard deviation
};
};
results: EvaluationResult[]; // Per-case detailed results
passed: boolean; // True if all thresholds met
totalCases: number; // Number of test cases
totalLatencyMs: number; // Total evaluation time
totalTokens: number; // Total tokens consumed
}

💡 Best Practices

  • Use all four metrics for RAG: relevance, correctness, faithfulness, hallucination
  • A good RAG system should have faithfulness > 0.85 and hallucination < 0.15
  • Create custom metrics for domain-specific quality checks (tone, length, format)
  • Run evaluations in CI/CD to catch regressions before deployment

Tree-shaking Imports

// ✅ Import evaluation types
import type { MetricFn, EvaluationSummary } from 'orkajs';
 
// ✅ Import built-in metrics individually
import { relevance, correctness, faithfulness, hallucination } from 'orkajs/evaluation';