Real-time Voice Agents

Build voice-enabled AI agents

Create conversational voice agents with a complete audio pipeline: speech-to-text transcription, LLM processing with tool execution, and text-to-speech synthesis. The RealtimeAgent handles the entire flow, allowing you to build voice assistants, phone bots, interactive voice applications, and accessibility features with just a few lines of code.

Installation

Install the realtime package. It includes adapters for OpenAI Whisper (STT) and OpenAI TTS, with support for custom adapters.

# ─────────────────────────────────────────────────────────────────────────────
# Install @orka-js/realtime for voice-enabled agents
# ─────────────────────────────────────────────────────────────────────────────
 
npm install @orka-js/realtime
 
# Or with pnpm
pnpm add @orka-js/realtime
 
# The package includes:
# - RealtimeAgent: Main voice agent class
# - OpenAISTTAdapter: Speech-to-text with Whisper
# - OpenAITTSAdapter: Text-to-speech with OpenAI voices
# - Types for audio events and streaming

Key Features

# Speech-to-Text

Transcribe audio with OpenAI Whisper or custom STT

# Text-to-Speech

Generate natural voice with OpenAI TTS

# Streaming Pipeline

Real-time audio → transcript → LLM → audio

# Tool Execution

Agents can call tools during voice conversations

# Multiple Formats

Support for WAV, MP3, OGG, WebM audio

# WebSocket Ready

Integrate with WebRTC for live conversations

Basic Usage

Create a voice agent that processes audio input and returns both text and audio responses.

realtime-agent.ts

// ─────────────────────────────────────────────────────────────────────────────
// Basic Voice Agent
//
// This example shows how to:
// 1. Set up the STT (Speech-to-Text) adapter for transcription
// 2. Set up the TTS (Text-to-Speech) adapter for voice output
// 3. Create a RealtimeAgent that processes audio through the full pipeline
// 4. Process an audio file and get both text and audio responses
// ─────────────────────────────────────────────────────────────────────────────
 
import { RealtimeAgent, OpenAISTTAdapter, OpenAITTSAdapter } from '@orka-js/realtime';
import { OpenAIAdapter } from '@orka-js/openai';
import fs from 'fs';
 
// ─── Step 1: Create the LLM adapter ──────────────────────────────────────────
// This is the "brain" of your voice agent - it processes the transcribed text
 
const llm = new OpenAIAdapter({ 
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o',  // Use a fast model for real-time conversations
});
 
// ─── Step 2: Create the Speech-to-Text adapter ───────────────────────────────
// Converts audio input to text using OpenAI Whisper
 
const stt = new OpenAISTTAdapter({ 
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'whisper-1',     // OpenAI's Whisper model
  language: 'en',          // Optional: specify language for better accuracy
});
 
// ─── Step 3: Create the Text-to-Speech adapter ───────────────────────────────
// Converts the agent's text response back to audio
 
const tts = new OpenAITTSAdapter({ 
  apiKey: process.env.OPENAI_API_KEY!,
  voice: 'alloy',          // Voice options: alloy, echo, fable, onyx, nova, shimmer
  model: 'tts-1',          // Use 'tts-1-hd' for higher quality (slower)
  speed: 1.0,              // Speed: 0.25 to 4.0
  responseFormat: 'mp3',   // Output format: mp3, opus, aac, flac
});
 
// ─── Step 4: Define tools the agent can use ──────────────────────────────────
// Voice agents can execute tools just like text agents
 
const searchProducts = {
  name: 'searchProducts',
  description: 'Search for products in the catalog',
  parameters: {
    type: 'object',
    properties: {
      query: { type: 'string', description: 'Search query' },
    },
    required: ['query'],
  },
  execute: async ({ query }: { query: string }) => {
    // Your search logic here
    return { products: ['Product A', 'Product B'] };
  },
};
 
// ─── Step 5: Create the RealtimeAgent ────────────────────────────────────────
 
const agent = new RealtimeAgent({
  config: {
    // The agent's purpose - guides its responses
    goal: 'Help customers find products and answer questions about our store',
 
    // System prompt for voice-specific behavior
    systemPrompt: `You are a friendly voice assistant for our online store.
Keep responses concise and conversational - remember this is spoken, not written.
Avoid long lists or complex formatting. Be warm and helpful.`,
 
    // Enable TTS output (set to false for text-only responses)
    tts: true,
  },
 
  // The adapters
  llm,
  stt,
  tts,
 
  // Tools the agent can use during conversations
  tools: [searchProducts],
});
 
// ─── Step 6: Process audio input ─────────────────────────────────────────────
 
// Read audio file (supports WAV, MP3, OGG, WebM, M4A, FLAC)
const audioBuffer = fs.readFileSync('./customer-question.wav');
 
// Process the audio through the full pipeline:
// Audio → Transcription → LLM → Response → Audio
const result = await agent.process(audioBuffer, 'audio/wav');
 
// ─── Step 7: Use the results ─────────────────────────────────────────────────
 
console.log('User said:', result.transcript);
// "Do you have any running shoes on sale?"
 
console.log('Agent response:', result.response);
// "Yes! We have several running shoes on sale right now. Let me search for you..."
 
console.log('Tool calls:', result.toolCalls);
// [{ name: 'searchProducts', args: { query: 'running shoes sale' }, result: {...} }]
 
// Save the audio response
if (result.audio) {
  fs.writeFileSync('./response.mp3', result.audio);
  console.log('Audio response saved to response.mp3');
}

# Streaming Events

Process audio in real-time with streaming events for immediate feedback.

streaming.ts

// ─────────────────────────────────────────────────────────────────────────────
// Streaming Voice Processing
//
// For real-time applications, use streaming to get immediate feedback:
// - Transcript appears as soon as STT finishes
// - LLM tokens stream as they're generated
// - Audio chunks stream as TTS generates them
//
// This enables low-latency voice interactions where the user hears
// the response before the full generation is complete.
// ─────────────────────────────────────────────────────────────────────────────
 
import { RealtimeAgent } from '@orka-js/realtime';
 
const agent = new RealtimeAgent({ /* ... config ... */ });
 
// Process audio with streaming events
for await (const event of agent.processStream(audioBuffer, 'audio/wav')) {
  switch (event.type) {
    // ─── Transcription Event ─────────────────────────────────────────────────
    // Fired when STT completes transcription
    case 'transcript':
      console.log('🎤 User said:', event.text);
      // Display the transcript in your UI immediately
      updateUI({ userMessage: event.text });
      break;
 
    // ─── LLM Token Event ─────────────────────────────────────────────────────
    // Fired for each token as the LLM generates the response
    case 'token':
      process.stdout.write(event.token);
      // Accumulate tokens for display
      appendToResponse(event.token);
      break;
 
    // ─── Tool Call Events ────────────────────────────────────────────────────
    // Fired when the agent calls a tool
    case 'tool_start':
      console.log(`🔧 Calling tool: ${event.name}`);
      showToolIndicator(event.name);
      break;
 
    case 'tool_end':
      console.log(`✅ Tool result: ${JSON.stringify(event.result)}`);
      hideToolIndicator();
      break;
 
    // ─── Audio Chunk Event ───────────────────────────────────────────────────
    // Fired as TTS generates audio chunks
    // Send these to the client for immediate playback
    case 'audio_chunk':
      // Send chunk to WebSocket for real-time playback
      websocket.send(event.chunk);
 
      // Or accumulate for later
      audioChunks.push(event.chunk);
      break;
 
    // ─── Completion Event ────────────────────────────────────────────────────
    // Fired when the entire pipeline completes
    case 'done':
      console.log('\n✅ Complete!');
      console.log('Full response:', event.response);
      console.log('Total duration:', event.duration, 'ms');
      break;
 
    // ─── Error Event ─────────────────────────────────────────────────────────
    case 'error':
      console.error('❌ Error:', event.error);
      showErrorMessage(event.error.message);
      break;
  }
}

# WebSocket Integration

Build live voice conversations with WebSocket for bidirectional audio streaming.

websocket-server.ts

// ─────────────────────────────────────────────────────────────────────────────
// WebSocket Integration for Live Voice Conversations
//
// This example shows how to build a real-time voice chat using WebSocket.
// The client sends audio chunks, and the server streams back audio responses.
// ─────────────────────────────────────────────────────────────────────────────
 
import { WebSocketServer } from 'ws';
import { RealtimeAgent } from '@orka-js/realtime';
 
// Create the voice agent
const agent = new RealtimeAgent({
  config: {
    goal: 'Have natural voice conversations',
    tts: true,
  },
  llm, stt, tts,
  tools: [/* your tools */],
});
 
// Create WebSocket server
const wss = new WebSocketServer({ port: 8080 });
 
wss.on('connection', (ws) => {
  console.log('Client connected');
 
  // Buffer to accumulate audio chunks from client
  let audioBuffer: Buffer[] = [];
 
  ws.on('message', async (data, isBinary) => {
    if (isBinary) {
      // ─── Receiving Audio from Client ───────────────────────────────────────
      // Client sends audio chunks as binary data
      audioBuffer.push(data as Buffer);
    } else {
      // ─── Control Messages ──────────────────────────────────────────────────
      const message = JSON.parse(data.toString());
 
      if (message.type === 'end_audio') {
        // Client finished sending audio, process it
        const fullAudio = Buffer.concat(audioBuffer);
        audioBuffer = []; // Reset buffer
 
        // Process with streaming
        for await (const event of agent.processStream(fullAudio, 'audio/wav')) {
          if (event.type === 'transcript') {
            // Send transcript to client
            ws.send(JSON.stringify({ type: 'transcript', text: event.text }));
          } else if (event.type === 'token') {
            // Send text tokens for display
            ws.send(JSON.stringify({ type: 'token', token: event.token }));
          } else if (event.type === 'audio_chunk') {
            // Send audio chunk as binary for immediate playback
            ws.send(event.chunk);
          } else if (event.type === 'done') {
            ws.send(JSON.stringify({ type: 'done' }));
          }
        }
      }
    }
  });
 
  ws.on('close', () => {
    console.log('Client disconnected');
  });
});
 
console.log('Voice WebSocket server running on ws://localhost:8080');

# Voice Configuration

Customize the voice output with different voices, speeds, and formats.

voices.ts

// ─────────────────────────────────────────────────────────────────────────────
// Voice Configuration
//
// Customize the voice output for different use cases.
// ─────────────────────────────────────────────────────────────────────────────
 
import { OpenAITTSAdapter } from '@orka-js/realtime';
 
// ─── Available OpenAI Voices ─────────────────────────────────────────────────
// Each voice has a distinct personality:
//
// - alloy:   Neutral, versatile (good default)
// - echo:    Warm, conversational
// - fable:   Expressive, storytelling
// - onyx:    Deep, authoritative
// - nova:    Friendly, upbeat
// - shimmer: Clear, professional
 
// ─── Professional Customer Service Voice ─────────────────────────────────────
const professionalTTS = new OpenAITTSAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: 'shimmer',        // Clear and professional
  model: 'tts-1-hd',       // High quality for important interactions
  speed: 1.0,              // Normal speed
  responseFormat: 'mp3',
});
 
// ─── Friendly Assistant Voice ────────────────────────────────────────────────
const friendlyTTS = new OpenAITTSAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: 'nova',           // Friendly and upbeat
  model: 'tts-1',          // Standard quality for faster responses
  speed: 1.1,              // Slightly faster for energetic feel
  responseFormat: 'opus',  // Smaller file size for web
});
 
// ─── Audiobook / Storytelling Voice ──────────────────────────────────────────
const storytellerTTS = new OpenAITTSAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: 'fable',          // Expressive for storytelling
  model: 'tts-1-hd',       // High quality for long-form content
  speed: 0.9,              // Slightly slower for clarity
  responseFormat: 'flac',  // Lossless for archival
});
 
// ─── Use Different Voices for Different Agents ───────────────────────────────
 
const supportAgent = new RealtimeAgent({
  config: { goal: 'Customer support', tts: true },
  llm, stt,
  tts: professionalTTS,  // Professional voice for support
});
 
const salesAgent = new RealtimeAgent({
  config: { goal: 'Sales assistance', tts: true },
  llm, stt,
  tts: friendlyTTS,      // Friendly voice for sales
});

# Conversation Memory

Maintain context across multiple voice interactions for natural conversations.

memory.ts

// ─────────────────────────────────────────────────────────────────────────────
// Conversation Memory for Voice Agents
//
// Maintain context across multiple voice interactions for natural conversations.
// The agent remembers what was said earlier in the conversation.
// ─────────────────────────────────────────────────────────────────────────────
 
import { RealtimeAgent } from '@orka-js/realtime';
import { Memory } from '@orka-js/memory-store';
 
// Create conversation memory
const memory = new Memory({
  maxMessages: 20,         // Keep last 20 messages
  strategy: 'sliding',     // Sliding window strategy
});
 
// Create agent with memory
const agent = new RealtimeAgent({
  config: {
    goal: 'Have natural multi-turn voice conversations',
    tts: true,
  },
  llm, stt, tts,
  memory,  // Attach memory to the agent
});
 
// ─── Multi-Turn Conversation ─────────────────────────────────────────────────
 
// Turn 1: User asks about products
const turn1 = await agent.process(audioBuffer1, 'audio/wav');
console.log('User:', turn1.transcript);
// "What laptops do you have?"
console.log('Agent:', turn1.response);
// "We have several laptops! We have the MacBook Pro, Dell XPS, and ThinkPad..."
 
// Turn 2: User follows up (agent remembers context)
const turn2 = await agent.process(audioBuffer2, 'audio/wav');
console.log('User:', turn2.transcript);
// "Which one is best for programming?"
console.log('Agent:', turn2.response);
// "For programming, I'd recommend the MacBook Pro or ThinkPad. 
//  The MacBook Pro has excellent build quality and the M3 chip is very fast..."
 
// Turn 3: User asks for comparison (agent still has context)
const turn3 = await agent.process(audioBuffer3, 'audio/wav');
console.log('User:', turn3.transcript);
// "How do they compare in price?"
console.log('Agent:', turn3.response);
// "The MacBook Pro starts at $1,999 while the ThinkPad starts at $1,299..."
 
// ─── Session Management ──────────────────────────────────────────────────────
 
// Clear memory to start a new conversation
memory.clear();
 
// Or use session IDs for multiple concurrent conversations
const sessionMemory = new SessionMemory({
  sessionId: 'user_123',
  ttlMinutes: 30,  // Session expires after 30 minutes of inactivity
});
 
const agentWithSession = new RealtimeAgent({
  config: { goal: 'Multi-user voice support', tts: true },
  llm, stt, tts,
  memory: sessionMemory,
});

Voice Agent Tips

Keep responses short and conversational - users are listening, not reading
Use streaming for real-time applications to minimize perceived latency
Choose the right voice for your use case - professional for support, friendly for sales
Use tts-1 for speed, tts-1-hd for quality in important interactions
Always handle errors gracefully - voice users can't see error messages