Evaluation Metrics

Measure response quality with built-in and custom metrics, ensuringproduction-grade reliability.

Why Evaluation?

LLM outputs are non-deterministic and can vary in quality. Evaluation metrics help you measure and track the quality of your RAG system, detect regressions, and compare different configurations. Orka provides both built-in metrics and the ability to create custom ones.

dataset-evaluation.ts

import { createOrka } from '@orka-js/core';
import { OpenAIAdapter } from '@orka-js/openai';
import { MemoryVectorAdapter } from '@orka-js/memory';
 
const orka = createOrka({
  llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
  vectorDB: new MemoryVectorAdapter(),
});
 
// Define your evaluation dataset
const dataset = [
  {
    input: 'What is Orka AI?',
    expectedOutput: 'A TypeScript framework for building LLM applications.',
    knowledge: 'docs',  // Knowledge base to use for RAG
  },
  {
    input: 'How do I install Orka?',
    expectedOutput: 'Run npm install orkajs',
    knowledge: 'docs',
  },
];
 
// Run evaluation with multiple metrics
const summary = await orka.evaluate({
  dataset,
  metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'],
});
 
console.log(summary.metrics);
// {
//   relevance:     { average: 0.95, min: 0.9, max: 1.0 },
//   correctness:   { average: 0.88, min: 0.85, max: 0.92 },
//   faithfulness:  { average: 0.92, min: 0.88, max: 0.96 },
//   hallucination: { average: 0.05, min: 0.0, max: 0.1 },
// }
 
console.log(summary.results);  // Detailed per-case results
console.log(summary.passed);   // true if all thresholds met

# Built-in Metrics

Orka provides five built-in metrics that cover the most important aspects of RAG quality:

relevance

Contextual Relevance

Measures the alignment between the user's intent and the generated response.

Target> 0.8

correctness

Semantic Correctness

Validates factual accuracy against ground truth, tolerating linguistic variations.

Target> 0.7

faithfulness

Grounded Faithfulness

Ensures the response is derived exclusively from the retrieved document chunks.

Target> 0.85

hallucination

Hallucination Rate

Identifies fabricated information not present in the source context.

Target< 0.15

cost

Operational Cost

Aggregated token consumption (Input + Output) per inference cycle.

TargetMonitor

# RAGAS Metrics

The RAGAS suite provides four production-grade metrics for RAG evaluation. Import them from @orka-js/evaluation — they share the same MetricFn interface as built-in metrics.

contextPrecision

Context Precision

LLM Judge

LLM judge: what fraction of retrieved context chunks are actually useful for the answer?

Target> 0.75

contextRecall

Context Recall

LLM Judge

LLM judge: does the retrieved context cover all aspects of the expected answer?

Target> 0.70

answerRelevance

Answer Relevance

Embedding

Cosine similarity between the question embedding and the answer embedding. High score = answer stays on topic.

Target> 0.80

semanticSimilarity

Semantic Similarity

Embedding

Cosine similarity between the generated output and the expected output embeddings. Requires expectedOutput.

Target> 0.85

ragas-evaluation.ts

import {
  contextPrecision,
  contextRecall,
  answerRelevance,
  semanticSimilarity,
  cosineSimilarity,
  ragasMetrics,
} from '@orka-js/evaluation';
import { OpenAIAdapter } from '@orka-js/openai';
 
const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! });
 
// Use individually
const precision = await contextPrecision({
  input: 'What is the capital of France?',
  output: 'Paris is the capital of France.',
  context: ['France is a country in Europe. Its capital is Paris.', 'Unrelated text.'],
  llm,
});
console.log(precision.score); // e.g. 0.87
 
// Use the ragasMetrics bundle with orka.evaluate()
const summary = await orka.evaluate({
  dataset,
  metrics: Object.values(ragasMetrics),
});
 
// cosineSimilarity helper
const sim = cosineSimilarity([0.1, 0.9, 0.3], [0.2, 0.8, 0.4]);
console.log(sim); // 0.998

# Custom Metrics

Create custom metrics for domain-specific quality checks. A custom metric is an async function that receives the evaluation context and returns a score.

custom-metrics.ts

import type { MetricFn } from 'orkajs';
 
// Custom metric: Check professionalism of tone
const toneCheck: MetricFn = async ({ input, output, context, llm }) => {
  const result = await llm.generate(
    `Rate the professionalism of this response on a scale of 0.0 to 1.0.
     Question: ${input}
     Response: ${output}
     Reply with ONLY a number.`,
    { temperature: 0, maxTokens: 10 }
  );
 
  const score = parseFloat(result.content.trim());
  return {
    name: 'professionalism',
    score: isNaN(score) ? 0 : Math.min(1, Math.max(0, score)),
  };
};
 
// Custom metric: Check response length
const lengthCheck: MetricFn = async ({ output }) => {
  const wordCount = output.split(/\s+/).length;
  // Penalize very short or very long responses
  const idealLength = 50;
  const score = Math.max(0, 1 - Math.abs(wordCount - idealLength) / idealLength);
  return { name: 'length_appropriateness', score };
};
 
// Use alongside built-in metrics
const summary = await orka.evaluate({
  dataset: [...],
  metrics: ['relevance', 'faithfulness', toneCheck, lengthCheck],
});

Evaluation Context Interface

MetricFn Schema & LLM-as-Judge Readiness

input / outputCore Data

Type: string

The core inference pair: original prompt vs. generated response. Primary source for semantic drift analysis.

expectedOutputGround Truth

Type: string | undefined

Ground truth reference. Essential for calculating exact match, rouge scores, or correctness.

contextRetrieval

Type: ChunkResult[]

The retrieval evidence. Used to measure Faithfulness (hallucination check) and Context Precision.

llmJudge

Type: LLMAdapter

Enables the LLM-as-Judge pattern. Allows recursive evaluation using a secondary high-reasoning model.

# Evaluation Result

interface EvaluationSummary {
  metrics: {
    [metricName: string]: {
      average: number;  // Average score across all cases
      min: number;      // Minimum score
      max: number;      // Maximum score
      stdDev: number;   // Standard deviation
    };
  };
  results: EvaluationResult[];  // Per-case detailed results
  passed: boolean;              // True if all thresholds met
  totalCases: number;           // Number of test cases
  totalLatencyMs: number;       // Total evaluation time
  totalTokens: number;          // Total tokens consumed
}

Quality Assurance Framework

Production Readiness Standards

Target Benchmarks

Faithfulness> 0.85

Strict factual grounding.

Hallucination< 0.15

Maximum tolerated deviance.

Strategic Best Practices

Deploy Holistic Evaluation: Mix relevance, correctness, and faithfulness for a 360° view.
Domain Customization: Tailor metrics for specific constraints (Tone, PII, Format).
CI/CD Gatekeeping: Automated regressing testing at every commit to prevent drift.

Tree-shaking Imports

// ✅ Import evaluation types
import type { MetricFn, EvaluationSummary } from 'orkajs';
 
// ✅ Import built-in metrics individually
import { relevance, correctness, faithfulness, hallucination } from '@orka-js/evaluation';