Caching

Optimize performance by avoiding redundant calls with intelligent caching. Reduce costs and latency usingIn-Memory or Redis-backed strategies.

Why Caching?

LLM API calls are expensive and slow. When the same prompt is sent multiple times (e.g., repeated user questions, batch processing, or development iterations), caching avoids redundant API calls by returning previously computed results instantly.

0ms

Cache hit latency (vs 500-3000ms API call)

Cost per cached response

100%

Deterministic results

# Architecture

Orka JS's caching system is composed of three layers that work together:

Storage Infrastructure

The backbone of caching. MemoryCache for speed, RedisCache for distributed persistence.

CacheStore

Response Acceleration

Wraps adapters to serve instant results for identical prompts, eliminating API costs.

CachedLLM

Vector Persistence

Prevents redundant vector calculations for repeated text chunks in RAG pipelines.

CachedEmbeddings

# MemoryCache (In-Memory)

The simplest cache store. Data is stored in-process using a Map. No external dependencies required. Ideal for development, single-instance servers, and short-lived processes.

memory-cache.ts

import { MemoryCache } from '@orka-js/cache';
 
const cache = new MemoryCache({
  maxSize: 1000,         // Maximum number of entries (default: 1000)
  ttlMs: 1000 * 60 * 30, // Time-to-live: 30 minutes (optional)
  namespace: 'my-app'    // Key prefix for isolation (optional)
});
 
// Basic operations
await cache.set('key', { data: 'value' });
const value = await cache.get('key');       // { data: 'value' }
const exists = await cache.has('key');      // true
await cache.delete('key');
await cache.clear();
 
// Get cache statistics
const stats = cache.getStats();
console.log(stats);
// { hits: 42, misses: 8, size: 150, hitRate: 0.84 }

Native Simplicity

Zero external dependencies. Operates entirely within the Node.js process memory.

Standalone

Temporal Expiry (TTL)

Granular Time-To-Live control. Data self-destructs after reaching its age limit.

Auto-Clean

LRU Eviction Policy

Automatic memory capping. Removes the oldest entries first to prevent overflow.

LRU Strategy

Logical Isolation

Namespace partitioning. Run multiple independent caches within a single instance.

Multitenant

Hit/Miss Telemetry

Built-in statistics to monitor cache efficiency and optimize your hit rates.

Monitoring

# RedisCache (Distributed)

For production environments with multiple server instances, Redis provides a shared, persistent cache. Data survives server restarts and is accessible from any instance.

📦 Installation Required

RedisCache requires the redis package:

npm install redis

import { RedisCache } from '@orka-js/cache';
 
const cache = new RedisCache({
  url: 'redis://localhost:6379',  // Redis connection URL
  keyPrefix: 'orka:',             // Key prefix (default: 'orka:')
  ttlMs: 1000 * 60 * 60,         // TTL: 1 hour (optional)
});
 
// Connect to Redis (auto-connects on first operation)
await cache.connect();
 
// Same API as MemoryCache
await cache.set('key', { data: 'value' });
const value = await cache.get('key');
 
// Disconnect when done
await cache.disconnect();

MemoryCacheSpeed Optimized

Zero dependency installation

Nanosecond access speed

Data volatile on restart

Isolated to single instance

RedisCacheScale Optimized

State persistent across reloads

Shared state across cluster

Requires external Redis server

Network overhead (~1-2ms)

# CachedLLM — LLM Response Caching

CachedLLM wraps any LLMAdapter and transparently caches responses. It implements the LLMAdapter interface, so you can use it as a drop-in replacement anywhere you use an LLM.

cached-llm.ts

import { OpenAIAdapter } from '@orka-js/openai';
import { MemoryCache } from '@orka-js/cache';
import { CachedLLM } from '@orka-js/cache';
 
// 1. Create your LLM adapter
const llm = new OpenAIAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o-mini'
});
 
// 2. Create a cache store
const cache = new MemoryCache({ maxSize: 500, ttlMs: 1000 * 60 * 30 });
 
// 3. Wrap with CachedLLM
const cachedLLM = new CachedLLM(llm, cache, {
  ttlMs: 1000 * 60 * 60  // Override TTL: cache for 1 hour
});
 
// First call — hits the API (~800ms)
const result1 = await cachedLLM.generate('What is TypeScript?');
console.log(result1.content); // "TypeScript is a typed superset..."
 
// Second call — instant from cache (~0ms)
const result2 = await cachedLLM.generate('What is TypeScript?');
console.log(result2.content); // Same response, from cache
 
// Use in Orka config — transparent replacement
import { createOrka } from 'orkajs';
 
const orka = createOrka({
  llm: cachedLLM,  // ← Drop-in replacement
  vectorDB: /* ... */
});
 
// All orka.ask(), orka.generate(), etc. now use caching
const answer = await orka.ask({
  question: 'What is TypeScript?',
  knowledge: 'docs'
});

Deterministic Hashing

Combines the raw prompt with all model parameters (temperature, maxTokens, stop sequences) to ensure an exact match.

hash(prompt + options)Key Integrity

Exact Match Scenario

Identical Input + Identical Config = 0ms latency. The engine skips inference and serves the stored result.

cache_hitEfficiency

Parameter Variance

Changing even a single option (like temperature) triggers a new inference to respect the model's stochastic behavior.

cache_missPrecision

# CachedEmbeddings — Embedding Caching

Embedding the same text multiple times is wasteful. CachedEmbeddings caches embedding vectors per text input and batches only uncached texts to the API.

cached-embeddings.ts

import { OpenAIAdapter } from '@orka-js/openai';
import { MemoryCache } from '@orka-js/cache';
import { CachedEmbeddings } from '@orka-js/cache';
 
const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! });
const cache = new MemoryCache({ maxSize: 10000 });
 
const cachedEmbed = new CachedEmbeddings(llm, cache, {
  ttlMs: 1000 * 60 * 60 * 24  // Cache embeddings for 24 hours
});
 
// First call — computes all 3 embeddings via API
const embeddings1 = await cachedEmbed.embed([
  'Hello world',
  'TypeScript is great',
  'Orka AI framework'
]);
 
// Second call — all 3 from cache (0 API calls)
const embeddings2 = await cachedEmbed.embed([
  'Hello world',
  'TypeScript is great',
  'Orka AI framework'
]);
 
// Mixed call — only "New text" hits the API, others from cache
const embeddings3 = await cachedEmbed.embed([
  'Hello world',      // ← from cache
  'New text here',    // ← API call
  'Orka AI framework' // ← from cache
]);

⚡ Smart Batching

CachedEmbeddings checks the cache for each text individually, then batches only the uncached texts into a single API call. This means if you embed 100 texts and 80 are cached, only 20 are sent to the API in one batch.

# Production Setup with Redis

For production, use RedisCache to share the cache across multiple server instances and persist data across restarts.

redis-cache.ts

import { createOrka, OpenAIAdapter } from '@orka-js/core';
import { RedisCache } from '@orka-js/cache';
import { CachedLLM } from '@orka-js/cache';
import { CachedEmbeddings } from '@orka-js/cache';
 
// Shared Redis cache
const redisCache = new RedisCache({
  url: process.env.REDIS_URL!,  // e.g. 'redis://redis:6379'
  keyPrefix: 'orka:prod:',
  ttlMs: 1000 * 60 * 60 * 4     // 4 hours
});
 
const llm = new OpenAIAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o-mini'
});
 
// Cache both LLM responses and embeddings
const cachedLLM = new CachedLLM(llm, redisCache);
const cachedEmbed = new CachedEmbeddings(llm, redisCache);
 
const orka = createOrka({
  llm: cachedLLM,
  vectorDB: /* ... */
});
 
// All operations now use Redis-backed caching
const result = await orka.ask({
  question: 'How do I deploy my app?',
  knowledge: 'documentation'
});
 
// Monitor cache performance
const stats = redisCache.getStats();
console.log(`Cache hit rate: ${(stats.hitRate * 100).toFixed(1)}%`);
 
// Cleanup on shutdown
process.on('SIGTERM', async () => {
  await redisCache.disconnect();
});

# Custom Cache Store

Implement the CacheStore interface to create your own cache backend (e.g., DynamoDB, Memcached, SQLite).

dynamodb-cache.ts

import type { CacheStore } from '@orka-js/cache';
 
class DynamoDBCache implements CacheStore {
  readonly name = 'dynamodb-cache';
 
  async get<T>(key: string): Promise<T | undefined> {
    // Your DynamoDB get logic
  }
 
  async set<T>(key: string, value: T, ttlMs?: number): Promise<void> {
    // Your DynamoDB put logic
  }
 
  async delete(key: string): Promise<boolean> {
    // Your DynamoDB delete logic
  }
 
  async clear(): Promise<void> {
    // Your DynamoDB scan + delete logic
  }
 
  async has(key: string): Promise<boolean> {
    // Your DynamoDB exists check
  }
}
 
// Use with CachedLLM
const cache = new DynamoDBCache();
const cachedLLM = new CachedLLM(llm, cache);

Best Practices

1. Set Appropriate TTLs

Short TTL (5-30 min) for dynamic content. Long TTL (hours/days) for stable knowledge bases. No TTL for immutable data like embeddings.

2. Monitor Hit Rates

Use getStats() to monitor cache effectiveness. A hit rate below 50% may indicate the cache is too small or TTL too short.

3. Don't Cache Non-Deterministic Calls

If you use high temperature (>0.8) for creative generation, caching may return stale creative outputs. Consider disabling cache for creative tasks.

4. Use Namespaces

Use different namespaces or key prefixes for different environments (dev, staging, prod) to avoid cache pollution.

Tree-shaking Imports

// ✅ Import only what you need
import { MemoryCache } from '@orka-js/cache';
import { CachedLLM } from '@orka-js/cache';
 
// ✅ Or import from index
import { MemoryCache, RedisCache, CachedLLM, CachedEmbeddings } from '@orka-js/cache';