Caching
Avoid redundant LLM and embedding calls with intelligent caching. Reduce latency, save costs, and improve throughput with in-memory or Redis-backed caches.
Why Caching?
LLM API calls are expensive and slow. When the same prompt is sent multiple times (e.g., repeated user questions, batch processing, or development iterations), caching avoids redundant API calls by returning previously computed results instantly.
Cache hit latency (vs 500-3000ms API call)
Cost per cached response
Deterministic results
# Architecture
Orka AI's caching system is composed of three layers that work together:
1. Cache Store (Storage Backend)
Where cached data is stored. Choose between MemoryCache (in-process, fast, no dependencies) or RedisCache (distributed, persistent, shared across instances).
2. CachedLLM (LLM Response Caching)
Wraps any LLMAdapter and caches generate() responses. Same prompt + options = instant cached response.
3. CachedEmbeddings (Embedding Caching)
Caches embedding vectors per text input. Avoids re-computing embeddings for already-seen text chunks.
# MemoryCache (In-Memory)
The simplest cache store. Data is stored in-process using a Map. No external dependencies required. Ideal for development, single-instance servers, and short-lived processes.
import { MemoryCache } from 'orkajs/cache/memory';Β const cache = new MemoryCache({ maxSize: 1000, // Maximum number of entries (default: 1000) ttlMs: 1000 * 60 * 30, // Time-to-live: 30 minutes (optional) namespace: 'my-app' // Key prefix for isolation (optional)});Β // Basic operationsawait cache.set('key', { data: 'value' });const value = await cache.get('key'); // { data: 'value' }const exists = await cache.has('key'); // trueawait cache.delete('key');await cache.clear();Β // Get cache statisticsconst stats = cache.getStats();console.log(stats);// { hits: 42, misses: 8, size: 150, hitRate: 0.84 }π§ MemoryCache Features
- β Zero dependencies β works out of the box
- β TTL support β entries expire automatically
- β Max size with LRU-like eviction (oldest first)
- β Namespace isolation β multiple caches in one instance
- β Hit/miss statistics for monitoring
# RedisCache (Distributed)
For production environments with multiple server instances, Redis provides a shared, persistent cache. Data survives server restarts and is accessible from any instance.
π¦ Installation Required
RedisCache requires the redis package:
npm install redisimport { RedisCache } from 'orkajs/cache/redis';Β const cache = new RedisCache({ url: 'redis://localhost:6379', // Redis connection URL keyPrefix: 'orka:', // Key prefix (default: 'orka:') ttlMs: 1000 * 60 * 60, // TTL: 1 hour (optional)});Β // Connect to Redis (auto-connects on first operation)await cache.connect();Β // Same API as MemoryCacheawait cache.set('key', { data: 'value' });const value = await cache.get('key');Β // Disconnect when doneawait cache.disconnect();MemoryCache
- β No dependencies
- β Fastest (in-process)
- β Lost on restart
- β Not shared between instances
RedisCache
- β Persistent across restarts
- β Shared between instances
- β Requires Redis server
- β Network latency (~1ms)
# CachedLLM β LLM Response Caching
CachedLLM wraps any LLMAdapter and transparently caches responses. It implements the LLMAdapter interface, so you can use it as a drop-in replacement anywhere you use an LLM.
import { OpenAIAdapter } from 'orkajs/adapters/openai';import { MemoryCache } from 'orkajs/cache/memory';import { CachedLLM } from 'orkajs/cache/llm';Β // 1. Create your LLM adapterconst llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o-mini'});Β // 2. Create a cache storeconst cache = new MemoryCache({ maxSize: 500, ttlMs: 1000 * 60 * 30 });Β // 3. Wrap with CachedLLMconst cachedLLM = new CachedLLM(llm, cache, { ttlMs: 1000 * 60 * 60 // Override TTL: cache for 1 hour});Β // First call β hits the API (~800ms)const result1 = await cachedLLM.generate('What is TypeScript?');console.log(result1.content); // "TypeScript is a typed superset..."Β // Second call β instant from cache (~0ms)const result2 = await cachedLLM.generate('What is TypeScript?');console.log(result2.content); // Same response, from cacheΒ // Use in Orka config β transparent replacementimport { createOrka } from 'orkajs';Β const orka = createOrka({ llm: cachedLLM, // β Drop-in replacement vectorDB: /* ... */});Β // All orka.ask(), orka.generate(), etc. now use cachingconst answer = await orka.ask({ question: 'What is TypeScript?', knowledge: 'docs'});π Cache Key Generation
The cache key is generated from the prompt AND the options (temperature, maxTokens, systemPrompt, etc.). This means:
- Same prompt + same options = cache hit β
- Same prompt + different temperature = cache miss β (different key)
# CachedEmbeddings β Embedding Caching
Embedding the same text multiple times is wasteful. CachedEmbeddings caches embedding vectors per text input and batches only uncached texts to the API.
import { OpenAIAdapter } from 'orkajs/adapters/openai';import { MemoryCache } from 'orkajs/cache/memory';import { CachedEmbeddings } from 'orkajs/cache/embeddings';Β const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! });const cache = new MemoryCache({ maxSize: 10000 });Β const cachedEmbed = new CachedEmbeddings(llm, cache, { ttlMs: 1000 * 60 * 60 * 24 // Cache embeddings for 24 hours});Β // First call β computes all 3 embeddings via APIconst embeddings1 = await cachedEmbed.embed([ 'Hello world', 'TypeScript is great', 'Orka AI framework']);Β // Second call β all 3 from cache (0 API calls)const embeddings2 = await cachedEmbed.embed([ 'Hello world', 'TypeScript is great', 'Orka AI framework']);Β // Mixed call β only "New text" hits the API, others from cacheconst embeddings3 = await cachedEmbed.embed([ 'Hello world', // β from cache 'New text here', // β API call 'Orka AI framework' // β from cache]);β‘ Smart Batching
CachedEmbeddings checks the cache for each text individually, then batches only the uncached texts into a single API call. This means if you embed 100 texts and 80 are cached, only 20 are sent to the API in one batch.
# Production Setup with Redis
For production, use RedisCache to share the cache across multiple server instances and persist data across restarts.
import { createOrka, OpenAIAdapter } from 'orkajs/core';import { RedisCache } from 'orkajs/cache/redis';import { CachedLLM } from 'orkajs/cache/llm';import { CachedEmbeddings } from 'orkajs/cache/embeddings';Β // Shared Redis cacheconst redisCache = new RedisCache({ url: process.env.REDIS_URL!, // e.g. 'redis://redis:6379' keyPrefix: 'orka:prod:', ttlMs: 1000 * 60 * 60 * 4 // 4 hours});Β const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o-mini'});Β // Cache both LLM responses and embeddingsconst cachedLLM = new CachedLLM(llm, redisCache);const cachedEmbed = new CachedEmbeddings(llm, redisCache);Β const orka = createOrka({ llm: cachedLLM, vectorDB: /* ... */});Β // All operations now use Redis-backed cachingconst result = await orka.ask({ question: 'How do I deploy my app?', knowledge: 'documentation'});Β // Monitor cache performanceconst stats = redisCache.getStats();console.log(`Cache hit rate: ${(stats.hitRate * 100).toFixed(1)}%`);Β // Cleanup on shutdownprocess.on('SIGTERM', async () => { await redisCache.disconnect();});# Custom Cache Store
Implement the CacheStore interface to create your own cache backend (e.g., DynamoDB, Memcached, SQLite).
import type { CacheStore } from 'orkajs/cache';Β class DynamoDBCache implements CacheStore { readonly name = 'dynamodb-cache';Β async get<T>(key: string): Promise<T | undefined> { // Your DynamoDB get logic }Β async set<T>(key: string, value: T, ttlMs?: number): Promise<void> { // Your DynamoDB put logic }Β async delete(key: string): Promise<boolean> { // Your DynamoDB delete logic }Β async clear(): Promise<void> { // Your DynamoDB scan + delete logic }Β async has(key: string): Promise<boolean> { // Your DynamoDB exists check }}Β // Use with CachedLLMconst cache = new DynamoDBCache();const cachedLLM = new CachedLLM(llm, cache);Best Practices
1. Set Appropriate TTLs
Short TTL (5-30 min) for dynamic content. Long TTL (hours/days) for stable knowledge bases. No TTL for immutable data like embeddings.
2. Monitor Hit Rates
Use getStats() to monitor cache effectiveness. A hit rate below 50% may indicate the cache is too small or TTL too short.
3. Don't Cache Non-Deterministic Calls
If you use high temperature (>0.8) for creative generation, caching may return stale creative outputs. Consider disabling cache for creative tasks.
4. Use Namespaces
Use different namespaces or key prefixes for different environments (dev, staging, prod) to avoid cache pollution.
Tree-shaking Imports
// β
Import only what you needimport { MemoryCache } from 'orkajs/cache/memory';import { CachedLLM } from 'orkajs/cache/llm';Β // β
Or import from indeximport { MemoryCache, RedisCache, CachedLLM, CachedEmbeddings } from 'orkajs/cache';