Real-time Voice Agents
Build voice-enabled AI agents
Create conversational voice agents with a complete audio pipeline: speech-to-text transcription, LLM processing with tool execution, and text-to-speech synthesis. The RealtimeAgent handles the entire flow, allowing you to build voice assistants, phone bots, interactive voice applications, and accessibility features with just a few lines of code.
Installation
Install the realtime package. It includes adapters for OpenAI Whisper (STT) and OpenAI TTS, with support for custom adapters.
# ─────────────────────────────────────────────────────────────────────────────# Install @orka-js/realtime for voice-enabled agents# ───────────────────────────────────────────────────────────────────────────── npm install @orka-js/realtime # Or with pnpmpnpm add @orka-js/realtime # The package includes:# - RealtimeAgent: Main voice agent class# - OpenAISTTAdapter: Speech-to-text with Whisper# - OpenAITTSAdapter: Text-to-speech with OpenAI voices# - Types for audio events and streamingKey Features
# Speech-to-Text
Transcribe audio with OpenAI Whisper or custom STT
# Text-to-Speech
Generate natural voice with OpenAI TTS
# Streaming Pipeline
Real-time audio → transcript → LLM → audio
# Tool Execution
Agents can call tools during voice conversations
# Multiple Formats
Support for WAV, MP3, OGG, WebM audio
# WebSocket Ready
Integrate with WebRTC for live conversations
Basic Usage
Create a voice agent that processes audio input and returns both text and audio responses.
// ─────────────────────────────────────────────────────────────────────────────// Basic Voice Agent//// This example shows how to:// 1. Set up the STT (Speech-to-Text) adapter for transcription// 2. Set up the TTS (Text-to-Speech) adapter for voice output// 3. Create a RealtimeAgent that processes audio through the full pipeline// 4. Process an audio file and get both text and audio responses// ───────────────────────────────────────────────────────────────────────────── import { RealtimeAgent, OpenAISTTAdapter, OpenAITTSAdapter } from '@orka-js/realtime';import { OpenAIAdapter } from '@orka-js/openai';import fs from 'fs'; // ─── Step 1: Create the LLM adapter ──────────────────────────────────────────// This is the "brain" of your voice agent - it processes the transcribed text const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o', // Use a fast model for real-time conversations}); // ─── Step 2: Create the Speech-to-Text adapter ───────────────────────────────// Converts audio input to text using OpenAI Whisper const stt = new OpenAISTTAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'whisper-1', // OpenAI's Whisper model language: 'en', // Optional: specify language for better accuracy}); // ─── Step 3: Create the Text-to-Speech adapter ───────────────────────────────// Converts the agent's text response back to audio const tts = new OpenAITTSAdapter({ apiKey: process.env.OPENAI_API_KEY!, voice: 'alloy', // Voice options: alloy, echo, fable, onyx, nova, shimmer model: 'tts-1', // Use 'tts-1-hd' for higher quality (slower) speed: 1.0, // Speed: 0.25 to 4.0 responseFormat: 'mp3', // Output format: mp3, opus, aac, flac}); // ─── Step 4: Define tools the agent can use ──────────────────────────────────// Voice agents can execute tools just like text agents const searchProducts = { name: 'searchProducts', description: 'Search for products in the catalog', parameters: { type: 'object', properties: { query: { type: 'string', description: 'Search query' }, }, required: ['query'], }, execute: async ({ query }: { query: string }) => { // Your search logic here return { products: ['Product A', 'Product B'] }; },}; // ─── Step 5: Create the RealtimeAgent ──────────────────────────────────────── const agent = new RealtimeAgent({ config: { // The agent's purpose - guides its responses goal: 'Help customers find products and answer questions about our store', // System prompt for voice-specific behavior systemPrompt: `You are a friendly voice assistant for our online store.Keep responses concise and conversational - remember this is spoken, not written.Avoid long lists or complex formatting. Be warm and helpful.`, // Enable TTS output (set to false for text-only responses) tts: true, }, // The adapters llm, stt, tts, // Tools the agent can use during conversations tools: [searchProducts],}); // ─── Step 6: Process audio input ───────────────────────────────────────────── // Read audio file (supports WAV, MP3, OGG, WebM, M4A, FLAC)const audioBuffer = fs.readFileSync('./customer-question.wav'); // Process the audio through the full pipeline:// Audio → Transcription → LLM → Response → Audioconst result = await agent.process(audioBuffer, 'audio/wav'); // ─── Step 7: Use the results ───────────────────────────────────────────────── console.log('User said:', result.transcript);// "Do you have any running shoes on sale?" console.log('Agent response:', result.response);// "Yes! We have several running shoes on sale right now. Let me search for you..." console.log('Tool calls:', result.toolCalls);// [{ name: 'searchProducts', args: { query: 'running shoes sale' }, result: {...} }] // Save the audio responseif (result.audio) { fs.writeFileSync('./response.mp3', result.audio); console.log('Audio response saved to response.mp3');}# Streaming Events
Process audio in real-time with streaming events for immediate feedback.
// ─────────────────────────────────────────────────────────────────────────────// Streaming Voice Processing//// For real-time applications, use streaming to get immediate feedback:// - Transcript appears as soon as STT finishes// - LLM tokens stream as they're generated// - Audio chunks stream as TTS generates them//// This enables low-latency voice interactions where the user hears// the response before the full generation is complete.// ───────────────────────────────────────────────────────────────────────────── import { RealtimeAgent } from '@orka-js/realtime'; const agent = new RealtimeAgent({ /* ... config ... */ }); // Process audio with streaming eventsfor await (const event of agent.processStream(audioBuffer, 'audio/wav')) { switch (event.type) { // ─── Transcription Event ───────────────────────────────────────────────── // Fired when STT completes transcription case 'transcript': console.log('🎤 User said:', event.text); // Display the transcript in your UI immediately updateUI({ userMessage: event.text }); break; // ─── LLM Token Event ───────────────────────────────────────────────────── // Fired for each token as the LLM generates the response case 'token': process.stdout.write(event.token); // Accumulate tokens for display appendToResponse(event.token); break; // ─── Tool Call Events ──────────────────────────────────────────────────── // Fired when the agent calls a tool case 'tool_start': console.log(`🔧 Calling tool: ${event.name}`); showToolIndicator(event.name); break; case 'tool_end': console.log(`✅ Tool result: ${JSON.stringify(event.result)}`); hideToolIndicator(); break; // ─── Audio Chunk Event ─────────────────────────────────────────────────── // Fired as TTS generates audio chunks // Send these to the client for immediate playback case 'audio_chunk': // Send chunk to WebSocket for real-time playback websocket.send(event.chunk); // Or accumulate for later audioChunks.push(event.chunk); break; // ─── Completion Event ──────────────────────────────────────────────────── // Fired when the entire pipeline completes case 'done': console.log('\n✅ Complete!'); console.log('Full response:', event.response); console.log('Total duration:', event.duration, 'ms'); break; // ─── Error Event ───────────────────────────────────────────────────────── case 'error': console.error('❌ Error:', event.error); showErrorMessage(event.error.message); break; }}# WebSocket Integration
Build live voice conversations with WebSocket for bidirectional audio streaming.
// ─────────────────────────────────────────────────────────────────────────────// WebSocket Integration for Live Voice Conversations//// This example shows how to build a real-time voice chat using WebSocket.// The client sends audio chunks, and the server streams back audio responses.// ───────────────────────────────────────────────────────────────────────────── import { WebSocketServer } from 'ws';import { RealtimeAgent } from '@orka-js/realtime'; // Create the voice agentconst agent = new RealtimeAgent({ config: { goal: 'Have natural voice conversations', tts: true, }, llm, stt, tts, tools: [/* your tools */],}); // Create WebSocket serverconst wss = new WebSocketServer({ port: 8080 }); wss.on('connection', (ws) => { console.log('Client connected'); // Buffer to accumulate audio chunks from client let audioBuffer: Buffer[] = []; ws.on('message', async (data, isBinary) => { if (isBinary) { // ─── Receiving Audio from Client ─────────────────────────────────────── // Client sends audio chunks as binary data audioBuffer.push(data as Buffer); } else { // ─── Control Messages ────────────────────────────────────────────────── const message = JSON.parse(data.toString()); if (message.type === 'end_audio') { // Client finished sending audio, process it const fullAudio = Buffer.concat(audioBuffer); audioBuffer = []; // Reset buffer // Process with streaming for await (const event of agent.processStream(fullAudio, 'audio/wav')) { if (event.type === 'transcript') { // Send transcript to client ws.send(JSON.stringify({ type: 'transcript', text: event.text })); } else if (event.type === 'token') { // Send text tokens for display ws.send(JSON.stringify({ type: 'token', token: event.token })); } else if (event.type === 'audio_chunk') { // Send audio chunk as binary for immediate playback ws.send(event.chunk); } else if (event.type === 'done') { ws.send(JSON.stringify({ type: 'done' })); } } } } }); ws.on('close', () => { console.log('Client disconnected'); });}); console.log('Voice WebSocket server running on ws://localhost:8080');# Voice Configuration
Customize the voice output with different voices, speeds, and formats.
// ─────────────────────────────────────────────────────────────────────────────// Voice Configuration//// Customize the voice output for different use cases.// ───────────────────────────────────────────────────────────────────────────── import { OpenAITTSAdapter } from '@orka-js/realtime'; // ─── Available OpenAI Voices ─────────────────────────────────────────────────// Each voice has a distinct personality://// - alloy: Neutral, versatile (good default)// - echo: Warm, conversational// - fable: Expressive, storytelling// - onyx: Deep, authoritative// - nova: Friendly, upbeat// - shimmer: Clear, professional // ─── Professional Customer Service Voice ─────────────────────────────────────const professionalTTS = new OpenAITTSAdapter({ apiKey: process.env.OPENAI_API_KEY!, voice: 'shimmer', // Clear and professional model: 'tts-1-hd', // High quality for important interactions speed: 1.0, // Normal speed responseFormat: 'mp3',}); // ─── Friendly Assistant Voice ────────────────────────────────────────────────const friendlyTTS = new OpenAITTSAdapter({ apiKey: process.env.OPENAI_API_KEY!, voice: 'nova', // Friendly and upbeat model: 'tts-1', // Standard quality for faster responses speed: 1.1, // Slightly faster for energetic feel responseFormat: 'opus', // Smaller file size for web}); // ─── Audiobook / Storytelling Voice ──────────────────────────────────────────const storytellerTTS = new OpenAITTSAdapter({ apiKey: process.env.OPENAI_API_KEY!, voice: 'fable', // Expressive for storytelling model: 'tts-1-hd', // High quality for long-form content speed: 0.9, // Slightly slower for clarity responseFormat: 'flac', // Lossless for archival}); // ─── Use Different Voices for Different Agents ─────────────────────────────── const supportAgent = new RealtimeAgent({ config: { goal: 'Customer support', tts: true }, llm, stt, tts: professionalTTS, // Professional voice for support}); const salesAgent = new RealtimeAgent({ config: { goal: 'Sales assistance', tts: true }, llm, stt, tts: friendlyTTS, // Friendly voice for sales});# Conversation Memory
Maintain context across multiple voice interactions for natural conversations.
// ─────────────────────────────────────────────────────────────────────────────// Conversation Memory for Voice Agents//// Maintain context across multiple voice interactions for natural conversations.// The agent remembers what was said earlier in the conversation.// ───────────────────────────────────────────────────────────────────────────── import { RealtimeAgent } from '@orka-js/realtime';import { Memory } from '@orka-js/memory-store'; // Create conversation memoryconst memory = new Memory({ maxMessages: 20, // Keep last 20 messages strategy: 'sliding', // Sliding window strategy}); // Create agent with memoryconst agent = new RealtimeAgent({ config: { goal: 'Have natural multi-turn voice conversations', tts: true, }, llm, stt, tts, memory, // Attach memory to the agent}); // ─── Multi-Turn Conversation ───────────────────────────────────────────────── // Turn 1: User asks about productsconst turn1 = await agent.process(audioBuffer1, 'audio/wav');console.log('User:', turn1.transcript);// "What laptops do you have?"console.log('Agent:', turn1.response);// "We have several laptops! We have the MacBook Pro, Dell XPS, and ThinkPad..." // Turn 2: User follows up (agent remembers context)const turn2 = await agent.process(audioBuffer2, 'audio/wav');console.log('User:', turn2.transcript);// "Which one is best for programming?"console.log('Agent:', turn2.response);// "For programming, I'd recommend the MacBook Pro or ThinkPad. // The MacBook Pro has excellent build quality and the M3 chip is very fast..." // Turn 3: User asks for comparison (agent still has context)const turn3 = await agent.process(audioBuffer3, 'audio/wav');console.log('User:', turn3.transcript);// "How do they compare in price?"console.log('Agent:', turn3.response);// "The MacBook Pro starts at $1,999 while the ThinkPad starts at $1,299..." // ─── Session Management ────────────────────────────────────────────────────── // Clear memory to start a new conversationmemory.clear(); // Or use session IDs for multiple concurrent conversationsconst sessionMemory = new SessionMemory({ sessionId: 'user_123', ttlMinutes: 30, // Session expires after 30 minutes of inactivity}); const agentWithSession = new RealtimeAgent({ config: { goal: 'Multi-user voice support', tts: true }, llm, stt, tts, memory: sessionMemory,});Voice Agent Tips
- Keep responses short and conversational - users are listening, not reading
- Use streaming for real-time applications to minimize perceived latency
- Choose the right voice for your use case - professional for support, friendly for sales
- Use tts-1 for speed, tts-1-hd for quality in important interactions
- Always handle errors gracefully - voice users can't see error messages