OrkaJS
Orka.JS

Real-time Voice Agents

Build voice-enabled AI agents

Create conversational voice agents with a complete audio pipeline: speech-to-text transcription, LLM processing with tool execution, and text-to-speech synthesis. The RealtimeAgent handles the entire flow, allowing you to build voice assistants, phone bots, interactive voice applications, and accessibility features with just a few lines of code.

Installation

Install the realtime package. It includes adapters for OpenAI Whisper (STT) and OpenAI TTS, with support for custom adapters.

# ─────────────────────────────────────────────────────────────────────────────
# Install @orka-js/realtime for voice-enabled agents
# ─────────────────────────────────────────────────────────────────────────────
 
npm install @orka-js/realtime
 
# Or with pnpm
pnpm add @orka-js/realtime
 
# The package includes:
# - RealtimeAgent: Main voice agent class
# - OpenAISTTAdapter: Speech-to-text with Whisper
# - OpenAITTSAdapter: Text-to-speech with OpenAI voices
# - Types for audio events and streaming

Key Features

# Speech-to-Text

Transcribe audio with OpenAI Whisper or custom STT

# Text-to-Speech

Generate natural voice with OpenAI TTS

# Streaming Pipeline

Real-time audio → transcript → LLM → audio

# Tool Execution

Agents can call tools during voice conversations

# Multiple Formats

Support for WAV, MP3, OGG, WebM audio

# WebSocket Ready

Integrate with WebRTC for live conversations

Basic Usage

Create a voice agent that processes audio input and returns both text and audio responses.

realtime-agent.ts
// ─────────────────────────────────────────────────────────────────────────────
// Basic Voice Agent
//
// This example shows how to:
// 1. Set up the STT (Speech-to-Text) adapter for transcription
// 2. Set up the TTS (Text-to-Speech) adapter for voice output
// 3. Create a RealtimeAgent that processes audio through the full pipeline
// 4. Process an audio file and get both text and audio responses
// ─────────────────────────────────────────────────────────────────────────────
 
import { RealtimeAgent, OpenAISTTAdapter, OpenAITTSAdapter } from '@orka-js/realtime';
import { OpenAIAdapter } from '@orka-js/openai';
import fs from 'fs';
 
// ─── Step 1: Create the LLM adapter ──────────────────────────────────────────
// This is the "brain" of your voice agent - it processes the transcribed text
 
const llm = new OpenAIAdapter({
apiKey: process.env.OPENAI_API_KEY!,
model: 'gpt-4o', // Use a fast model for real-time conversations
});
 
// ─── Step 2: Create the Speech-to-Text adapter ───────────────────────────────
// Converts audio input to text using OpenAI Whisper
 
const stt = new OpenAISTTAdapter({
apiKey: process.env.OPENAI_API_KEY!,
model: 'whisper-1', // OpenAI's Whisper model
language: 'en', // Optional: specify language for better accuracy
});
 
// ─── Step 3: Create the Text-to-Speech adapter ───────────────────────────────
// Converts the agent's text response back to audio
 
const tts = new OpenAITTSAdapter({
apiKey: process.env.OPENAI_API_KEY!,
voice: 'alloy', // Voice options: alloy, echo, fable, onyx, nova, shimmer
model: 'tts-1', // Use 'tts-1-hd' for higher quality (slower)
speed: 1.0, // Speed: 0.25 to 4.0
responseFormat: 'mp3', // Output format: mp3, opus, aac, flac
});
 
// ─── Step 4: Define tools the agent can use ──────────────────────────────────
// Voice agents can execute tools just like text agents
 
const searchProducts = {
name: 'searchProducts',
description: 'Search for products in the catalog',
parameters: {
type: 'object',
properties: {
query: { type: 'string', description: 'Search query' },
},
required: ['query'],
},
execute: async ({ query }: { query: string }) => {
// Your search logic here
return { products: ['Product A', 'Product B'] };
},
};
 
// ─── Step 5: Create the RealtimeAgent ────────────────────────────────────────
 
const agent = new RealtimeAgent({
config: {
// The agent's purpose - guides its responses
goal: 'Help customers find products and answer questions about our store',
 
// System prompt for voice-specific behavior
systemPrompt: `You are a friendly voice assistant for our online store.
Keep responses concise and conversational - remember this is spoken, not written.
Avoid long lists or complex formatting. Be warm and helpful.`,
 
// Enable TTS output (set to false for text-only responses)
tts: true,
},
 
// The adapters
llm,
stt,
tts,
 
// Tools the agent can use during conversations
tools: [searchProducts],
});
 
// ─── Step 6: Process audio input ─────────────────────────────────────────────
 
// Read audio file (supports WAV, MP3, OGG, WebM, M4A, FLAC)
const audioBuffer = fs.readFileSync('./customer-question.wav');
 
// Process the audio through the full pipeline:
// Audio → Transcription → LLM → Response → Audio
const result = await agent.process(audioBuffer, 'audio/wav');
 
// ─── Step 7: Use the results ─────────────────────────────────────────────────
 
console.log('User said:', result.transcript);
// "Do you have any running shoes on sale?"
 
console.log('Agent response:', result.response);
// "Yes! We have several running shoes on sale right now. Let me search for you..."
 
console.log('Tool calls:', result.toolCalls);
// [{ name: 'searchProducts', args: { query: 'running shoes sale' }, result: {...} }]
 
// Save the audio response
if (result.audio) {
fs.writeFileSync('./response.mp3', result.audio);
console.log('Audio response saved to response.mp3');
}

# Streaming Events

Process audio in real-time with streaming events for immediate feedback.

streaming.ts
// ─────────────────────────────────────────────────────────────────────────────
// Streaming Voice Processing
//
// For real-time applications, use streaming to get immediate feedback:
// - Transcript appears as soon as STT finishes
// - LLM tokens stream as they're generated
// - Audio chunks stream as TTS generates them
//
// This enables low-latency voice interactions where the user hears
// the response before the full generation is complete.
// ─────────────────────────────────────────────────────────────────────────────
 
import { RealtimeAgent } from '@orka-js/realtime';
 
const agent = new RealtimeAgent({ /* ... config ... */ });
 
// Process audio with streaming events
for await (const event of agent.processStream(audioBuffer, 'audio/wav')) {
switch (event.type) {
// ─── Transcription Event ─────────────────────────────────────────────────
// Fired when STT completes transcription
case 'transcript':
console.log('🎤 User said:', event.text);
// Display the transcript in your UI immediately
updateUI({ userMessage: event.text });
break;
 
// ─── LLM Token Event ─────────────────────────────────────────────────────
// Fired for each token as the LLM generates the response
case 'token':
process.stdout.write(event.token);
// Accumulate tokens for display
appendToResponse(event.token);
break;
 
// ─── Tool Call Events ────────────────────────────────────────────────────
// Fired when the agent calls a tool
case 'tool_start':
console.log(`🔧 Calling tool: ${event.name}`);
showToolIndicator(event.name);
break;
 
case 'tool_end':
console.log(`✅ Tool result: ${JSON.stringify(event.result)}`);
hideToolIndicator();
break;
 
// ─── Audio Chunk Event ───────────────────────────────────────────────────
// Fired as TTS generates audio chunks
// Send these to the client for immediate playback
case 'audio_chunk':
// Send chunk to WebSocket for real-time playback
websocket.send(event.chunk);
 
// Or accumulate for later
audioChunks.push(event.chunk);
break;
 
// ─── Completion Event ────────────────────────────────────────────────────
// Fired when the entire pipeline completes
case 'done':
console.log('\n✅ Complete!');
console.log('Full response:', event.response);
console.log('Total duration:', event.duration, 'ms');
break;
 
// ─── Error Event ─────────────────────────────────────────────────────────
case 'error':
console.error('❌ Error:', event.error);
showErrorMessage(event.error.message);
break;
}
}

# WebSocket Integration

Build live voice conversations with WebSocket for bidirectional audio streaming.

websocket-server.ts
// ─────────────────────────────────────────────────────────────────────────────
// WebSocket Integration for Live Voice Conversations
//
// This example shows how to build a real-time voice chat using WebSocket.
// The client sends audio chunks, and the server streams back audio responses.
// ─────────────────────────────────────────────────────────────────────────────
 
import { WebSocketServer } from 'ws';
import { RealtimeAgent } from '@orka-js/realtime';
 
// Create the voice agent
const agent = new RealtimeAgent({
config: {
goal: 'Have natural voice conversations',
tts: true,
},
llm, stt, tts,
tools: [/* your tools */],
});
 
// Create WebSocket server
const wss = new WebSocketServer({ port: 8080 });
 
wss.on('connection', (ws) => {
console.log('Client connected');
 
// Buffer to accumulate audio chunks from client
let audioBuffer: Buffer[] = [];
 
ws.on('message', async (data, isBinary) => {
if (isBinary) {
// ─── Receiving Audio from Client ───────────────────────────────────────
// Client sends audio chunks as binary data
audioBuffer.push(data as Buffer);
} else {
// ─── Control Messages ──────────────────────────────────────────────────
const message = JSON.parse(data.toString());
 
if (message.type === 'end_audio') {
// Client finished sending audio, process it
const fullAudio = Buffer.concat(audioBuffer);
audioBuffer = []; // Reset buffer
 
// Process with streaming
for await (const event of agent.processStream(fullAudio, 'audio/wav')) {
if (event.type === 'transcript') {
// Send transcript to client
ws.send(JSON.stringify({ type: 'transcript', text: event.text }));
} else if (event.type === 'token') {
// Send text tokens for display
ws.send(JSON.stringify({ type: 'token', token: event.token }));
} else if (event.type === 'audio_chunk') {
// Send audio chunk as binary for immediate playback
ws.send(event.chunk);
} else if (event.type === 'done') {
ws.send(JSON.stringify({ type: 'done' }));
}
}
}
}
});
 
ws.on('close', () => {
console.log('Client disconnected');
});
});
 
console.log('Voice WebSocket server running on ws://localhost:8080');

# Voice Configuration

Customize the voice output with different voices, speeds, and formats.

voices.ts
// ─────────────────────────────────────────────────────────────────────────────
// Voice Configuration
//
// Customize the voice output for different use cases.
// ─────────────────────────────────────────────────────────────────────────────
 
import { OpenAITTSAdapter } from '@orka-js/realtime';
 
// ─── Available OpenAI Voices ─────────────────────────────────────────────────
// Each voice has a distinct personality:
//
// - alloy: Neutral, versatile (good default)
// - echo: Warm, conversational
// - fable: Expressive, storytelling
// - onyx: Deep, authoritative
// - nova: Friendly, upbeat
// - shimmer: Clear, professional
 
// ─── Professional Customer Service Voice ─────────────────────────────────────
const professionalTTS = new OpenAITTSAdapter({
apiKey: process.env.OPENAI_API_KEY!,
voice: 'shimmer', // Clear and professional
model: 'tts-1-hd', // High quality for important interactions
speed: 1.0, // Normal speed
responseFormat: 'mp3',
});
 
// ─── Friendly Assistant Voice ────────────────────────────────────────────────
const friendlyTTS = new OpenAITTSAdapter({
apiKey: process.env.OPENAI_API_KEY!,
voice: 'nova', // Friendly and upbeat
model: 'tts-1', // Standard quality for faster responses
speed: 1.1, // Slightly faster for energetic feel
responseFormat: 'opus', // Smaller file size for web
});
 
// ─── Audiobook / Storytelling Voice ──────────────────────────────────────────
const storytellerTTS = new OpenAITTSAdapter({
apiKey: process.env.OPENAI_API_KEY!,
voice: 'fable', // Expressive for storytelling
model: 'tts-1-hd', // High quality for long-form content
speed: 0.9, // Slightly slower for clarity
responseFormat: 'flac', // Lossless for archival
});
 
// ─── Use Different Voices for Different Agents ───────────────────────────────
 
const supportAgent = new RealtimeAgent({
config: { goal: 'Customer support', tts: true },
llm, stt,
tts: professionalTTS, // Professional voice for support
});
 
const salesAgent = new RealtimeAgent({
config: { goal: 'Sales assistance', tts: true },
llm, stt,
tts: friendlyTTS, // Friendly voice for sales
});

# Conversation Memory

Maintain context across multiple voice interactions for natural conversations.

memory.ts
// ─────────────────────────────────────────────────────────────────────────────
// Conversation Memory for Voice Agents
//
// Maintain context across multiple voice interactions for natural conversations.
// The agent remembers what was said earlier in the conversation.
// ─────────────────────────────────────────────────────────────────────────────
 
import { RealtimeAgent } from '@orka-js/realtime';
import { Memory } from '@orka-js/memory-store';
 
// Create conversation memory
const memory = new Memory({
maxMessages: 20, // Keep last 20 messages
strategy: 'sliding', // Sliding window strategy
});
 
// Create agent with memory
const agent = new RealtimeAgent({
config: {
goal: 'Have natural multi-turn voice conversations',
tts: true,
},
llm, stt, tts,
memory, // Attach memory to the agent
});
 
// ─── Multi-Turn Conversation ─────────────────────────────────────────────────
 
// Turn 1: User asks about products
const turn1 = await agent.process(audioBuffer1, 'audio/wav');
console.log('User:', turn1.transcript);
// "What laptops do you have?"
console.log('Agent:', turn1.response);
// "We have several laptops! We have the MacBook Pro, Dell XPS, and ThinkPad..."
 
// Turn 2: User follows up (agent remembers context)
const turn2 = await agent.process(audioBuffer2, 'audio/wav');
console.log('User:', turn2.transcript);
// "Which one is best for programming?"
console.log('Agent:', turn2.response);
// "For programming, I'd recommend the MacBook Pro or ThinkPad.
// The MacBook Pro has excellent build quality and the M3 chip is very fast..."
 
// Turn 3: User asks for comparison (agent still has context)
const turn3 = await agent.process(audioBuffer3, 'audio/wav');
console.log('User:', turn3.transcript);
// "How do they compare in price?"
console.log('Agent:', turn3.response);
// "The MacBook Pro starts at $1,999 while the ThinkPad starts at $1,299..."
 
// ─── Session Management ──────────────────────────────────────────────────────
 
// Clear memory to start a new conversation
memory.clear();
 
// Or use session IDs for multiple concurrent conversations
const sessionMemory = new SessionMemory({
sessionId: 'user_123',
ttlMinutes: 30, // Session expires after 30 minutes of inactivity
});
 
const agentWithSession = new RealtimeAgent({
config: { goal: 'Multi-user voice support', tts: true },
llm, stt, tts,
memory: sessionMemory,
});

Voice Agent Tips

  • Keep responses short and conversational - users are listening, not reading
  • Use streaming for real-time applications to minimize perceived latency
  • Choose the right voice for your use case - professional for support, friendly for sales
  • Use tts-1 for speed, tts-1-hd for quality in important interactions
  • Always handle errors gracefully - voice users can't see error messages