Agents Vocaux Temps Réel
Construisez des agents IA vocaux
Créez des agents vocaux conversationnels avec un pipeline audio complet : transcription speech-to-text, traitement LLM avec exécution d'outils, et synthèse text-to-speech. Le RealtimeAgent gère tout le flux, vous permettant de construire des assistants vocaux, bots téléphoniques, applications vocales interactives et fonctionnalités d'accessibilité en quelques lignes de code.
Installation
Installez le package realtime. Il inclut des adaptateurs pour OpenAI Whisper (STT) et OpenAI TTS, avec support pour adaptateurs personnalisés.
# ─────────────────────────────────────────────────────────────────────────────# Install @orka-js/realtime for voice-enabled agents# ───────────────────────────────────────────────────────────────────────────── npm install @orka-js/realtime # Or with pnpmpnpm add @orka-js/realtime # The package includes:# - RealtimeAgent: Main voice agent class# - OpenAISTTAdapter: Speech-to-text with Whisper# - OpenAITTSAdapter: Text-to-speech with OpenAI voices# - Types for audio events and streamingFonctionnalités Clés
# Speech-to-Text
Transcrivez l'audio avec OpenAI Whisper ou STT custom
# Text-to-Speech
Générez une voix naturelle avec OpenAI TTS
# Pipeline Streaming
Audio → transcript → LLM → audio en temps réel
# Exécution d'Outils
Les agents peuvent appeler des outils pendant les conversations vocales
# Formats Multiples
Support WAV, MP3, OGG, WebM audio
# WebSocket Ready
Intégration avec WebRTC pour conversations live
Utilisation Basique
Créez un agent vocal qui traite l'entrée audio et retourne des réponses texte et audio.
// ─────────────────────────────────────────────────────────────────────────────// Basic Voice Agent//// This example shows how to:// 1. Set up the STT (Speech-to-Text) adapter for transcription// 2. Set up the TTS (Text-to-Speech) adapter for voice output// 3. Create a RealtimeAgent that processes audio through the full pipeline// 4. Process an audio file and get both text and audio responses// ───────────────────────────────────────────────────────────────────────────── import { RealtimeAgent, OpenAISTTAdapter, OpenAITTSAdapter } from '@orka-js/realtime';import { OpenAIAdapter } from '@orka-js/openai';import fs from 'fs'; // ─── Step 1: Create the LLM adapter ──────────────────────────────────────────// This is the "brain" of your voice agent - it processes the transcribed text const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o', // Use a fast model for real-time conversations}); // ─── Step 2: Create the Speech-to-Text adapter ───────────────────────────────// Converts audio input to text using OpenAI Whisper const stt = new OpenAISTTAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'whisper-1', // OpenAI's Whisper model language: 'en', // Optional: specify language for better accuracy}); // ─── Step 3: Create the Text-to-Speech adapter ───────────────────────────────// Converts the agent's text response back to audio const tts = new OpenAITTSAdapter({ apiKey: process.env.OPENAI_API_KEY!, voice: 'alloy', // Voice options: alloy, echo, fable, onyx, nova, shimmer model: 'tts-1', // Use 'tts-1-hd' for higher quality (slower) speed: 1.0, // Speed: 0.25 to 4.0 responseFormat: 'mp3', // Output format: mp3, opus, aac, flac}); // ─── Step 4: Define tools the agent can use ──────────────────────────────────// Voice agents can execute tools just like text agents const searchProducts = { name: 'searchProducts', description: 'Search for products in the catalog', parameters: { type: 'object', properties: { query: { type: 'string', description: 'Search query' }, }, required: ['query'], }, execute: async ({ query }: { query: string }) => { // Your search logic here return { products: ['Product A', 'Product B'] }; },}; // ─── Step 5: Create the RealtimeAgent ──────────────────────────────────────── const agent = new RealtimeAgent({ config: { // The agent's purpose - guides its responses goal: 'Help customers find products and answer questions about our store', // System prompt for voice-specific behavior systemPrompt: `You are a friendly voice assistant for our online store.Keep responses concise and conversational - remember this is spoken, not written.Avoid long lists or complex formatting. Be warm and helpful.`, // Enable TTS output (set to false for text-only responses) tts: true, }, // The adapters llm, stt, tts, // Tools the agent can use during conversations tools: [searchProducts],}); // ─── Step 6: Process audio input ───────────────────────────────────────────── // Read audio file (supports WAV, MP3, OGG, WebM, M4A, FLAC)const audioBuffer = fs.readFileSync('./customer-question.wav'); // Process the audio through the full pipeline:// Audio → Transcription → LLM → Response → Audioconst result = await agent.process(audioBuffer, 'audio/wav'); // ─── Step 7: Use the results ───────────────────────────────────────────────── console.log('User said:', result.transcript);// "Do you have any running shoes on sale?" console.log('Agent response:', result.response);// "Yes! We have several running shoes on sale right now. Let me search for you..." console.log('Tool calls:', result.toolCalls);// [{ name: 'searchProducts', args: { query: 'running shoes sale' }, result: {...} }] // Save the audio responseif (result.audio) { fs.writeFileSync('./response.mp3', result.audio); console.log('Audio response saved to response.mp3');}# Événements Streaming
Traitez l'audio en temps réel avec des événements streaming pour un feedback immédiat.
// ─────────────────────────────────────────────────────────────────────────────// Streaming Voice Processing//// For real-time applications, use streaming to get immediate feedback:// - Transcript appears as soon as STT finishes// - LLM tokens stream as they're generated// - Audio chunks stream as TTS generates them//// This enables low-latency voice interactions where the user hears// the response before the full generation is complete.// ───────────────────────────────────────────────────────────────────────────── import { RealtimeAgent } from '@orka-js/realtime'; const agent = new RealtimeAgent({ /* ... config ... */ }); // Process audio with streaming eventsfor await (const event of agent.processStream(audioBuffer, 'audio/wav')) { switch (event.type) { // ─── Transcription Event ───────────────────────────────────────────────── // Fired when STT completes transcription case 'transcript': console.log('🎤 User said:', event.text); // Display the transcript in your UI immediately updateUI({ userMessage: event.text }); break; // ─── LLM Token Event ───────────────────────────────────────────────────── // Fired for each token as the LLM generates the response case 'token': process.stdout.write(event.token); // Accumulate tokens for display appendToResponse(event.token); break; // ─── Tool Call Events ──────────────────────────────────────────────────── // Fired when the agent calls a tool case 'tool_start': console.log(`🔧 Calling tool: ${event.name}`); showToolIndicator(event.name); break; case 'tool_end': console.log(`✅ Tool result: ${JSON.stringify(event.result)}`); hideToolIndicator(); break; // ─── Audio Chunk Event ─────────────────────────────────────────────────── // Fired as TTS generates audio chunks // Send these to the client for immediate playback case 'audio_chunk': // Send chunk to WebSocket for real-time playback websocket.send(event.chunk); // Or accumulate for later audioChunks.push(event.chunk); break; // ─── Completion Event ──────────────────────────────────────────────────── // Fired when the entire pipeline completes case 'done': console.log('\n✅ Complete!'); console.log('Full response:', event.response); console.log('Total duration:', event.duration, 'ms'); break; // ─── Error Event ───────────────────────────────────────────────────────── case 'error': console.error('❌ Error:', event.error); showErrorMessage(event.error.message); break; }}# Intégration WebSocket
Construisez des conversations vocales live avec WebSocket pour le streaming audio bidirectionnel.
// ─────────────────────────────────────────────────────────────────────────────// WebSocket Integration for Live Voice Conversations//// This example shows how to build a real-time voice chat using WebSocket.// The client sends audio chunks, and the server streams back audio responses.// ───────────────────────────────────────────────────────────────────────────── import { WebSocketServer } from 'ws';import { RealtimeAgent } from '@orka-js/realtime'; // Create the voice agentconst agent = new RealtimeAgent({ config: { goal: 'Have natural voice conversations', tts: true, }, llm, stt, tts, tools: [/* your tools */],}); // Create WebSocket serverconst wss = new WebSocketServer({ port: 8080 }); wss.on('connection', (ws) => { console.log('Client connected'); // Buffer to accumulate audio chunks from client let audioBuffer: Buffer[] = []; ws.on('message', async (data, isBinary) => { if (isBinary) { // ─── Receiving Audio from Client ─────────────────────────────────────── // Client sends audio chunks as binary data audioBuffer.push(data as Buffer); } else { // ─── Control Messages ────────────────────────────────────────────────── const message = JSON.parse(data.toString()); if (message.type === 'end_audio') { // Client finished sending audio, process it const fullAudio = Buffer.concat(audioBuffer); audioBuffer = []; // Reset buffer // Process with streaming for await (const event of agent.processStream(fullAudio, 'audio/wav')) { if (event.type === 'transcript') { // Send transcript to client ws.send(JSON.stringify({ type: 'transcript', text: event.text })); } else if (event.type === 'token') { // Send text tokens for display ws.send(JSON.stringify({ type: 'token', token: event.token })); } else if (event.type === 'audio_chunk') { // Send audio chunk as binary for immediate playback ws.send(event.chunk); } else if (event.type === 'done') { ws.send(JSON.stringify({ type: 'done' })); } } } } }); ws.on('close', () => { console.log('Client disconnected'); });}); console.log('Voice WebSocket server running on ws://localhost:8080');# Configuration Vocale
Personnalisez la sortie vocale avec différentes voix, vitesses et formats.
// ─────────────────────────────────────────────────────────────────────────────// Voice Configuration//// Customize the voice output for different use cases.// ───────────────────────────────────────────────────────────────────────────── import { OpenAITTSAdapter } from '@orka-js/realtime'; // ─── Available OpenAI Voices ─────────────────────────────────────────────────// Each voice has a distinct personality://// - alloy: Neutral, versatile (good default)// - echo: Warm, conversational// - fable: Expressive, storytelling// - onyx: Deep, authoritative// - nova: Friendly, upbeat// - shimmer: Clear, professional // ─── Professional Customer Service Voice ─────────────────────────────────────const professionalTTS = new OpenAITTSAdapter({ apiKey: process.env.OPENAI_API_KEY!, voice: 'shimmer', // Clear and professional model: 'tts-1-hd', // High quality for important interactions speed: 1.0, // Normal speed responseFormat: 'mp3',}); // ─── Friendly Assistant Voice ────────────────────────────────────────────────const friendlyTTS = new OpenAITTSAdapter({ apiKey: process.env.OPENAI_API_KEY!, voice: 'nova', // Friendly and upbeat model: 'tts-1', // Standard quality for faster responses speed: 1.1, // Slightly faster for energetic feel responseFormat: 'opus', // Smaller file size for web}); // ─── Audiobook / Storytelling Voice ──────────────────────────────────────────const storytellerTTS = new OpenAITTSAdapter({ apiKey: process.env.OPENAI_API_KEY!, voice: 'fable', // Expressive for storytelling model: 'tts-1-hd', // High quality for long-form content speed: 0.9, // Slightly slower for clarity responseFormat: 'flac', // Lossless for archival}); // ─── Use Different Voices for Different Agents ─────────────────────────────── const supportAgent = new RealtimeAgent({ config: { goal: 'Customer support', tts: true }, llm, stt, tts: professionalTTS, // Professional voice for support}); const salesAgent = new RealtimeAgent({ config: { goal: 'Sales assistance', tts: true }, llm, stt, tts: friendlyTTS, // Friendly voice for sales});# Mémoire Conversationnelle
Maintenez le contexte à travers plusieurs interactions vocales pour des conversations naturelles.
// ─────────────────────────────────────────────────────────────────────────────// Conversation Memory for Voice Agents//// Maintain context across multiple voice interactions for natural conversations.// The agent remembers what was said earlier in the conversation.// ───────────────────────────────────────────────────────────────────────────── import { RealtimeAgent } from '@orka-js/realtime';import { Memory } from '@orka-js/memory-store'; // Create conversation memoryconst memory = new Memory({ maxMessages: 20, // Keep last 20 messages strategy: 'sliding', // Sliding window strategy}); // Create agent with memoryconst agent = new RealtimeAgent({ config: { goal: 'Have natural multi-turn voice conversations', tts: true, }, llm, stt, tts, memory, // Attach memory to the agent}); // ─── Multi-Turn Conversation ───────────────────────────────────────────────── // Turn 1: User asks about productsconst turn1 = await agent.process(audioBuffer1, 'audio/wav');console.log('User:', turn1.transcript);// "What laptops do you have?"console.log('Agent:', turn1.response);// "We have several laptops! We have the MacBook Pro, Dell XPS, and ThinkPad..." // Turn 2: User follows up (agent remembers context)const turn2 = await agent.process(audioBuffer2, 'audio/wav');console.log('User:', turn2.transcript);// "Which one is best for programming?"console.log('Agent:', turn2.response);// "For programming, I'd recommend the MacBook Pro or ThinkPad. // The MacBook Pro has excellent build quality and the M3 chip is very fast..." // Turn 3: User asks for comparison (agent still has context)const turn3 = await agent.process(audioBuffer3, 'audio/wav');console.log('User:', turn3.transcript);// "How do they compare in price?"console.log('Agent:', turn3.response);// "The MacBook Pro starts at $1,999 while the ThinkPad starts at $1,299..." // ─── Session Management ────────────────────────────────────────────────────── // Clear memory to start a new conversationmemory.clear(); // Or use session IDs for multiple concurrent conversationsconst sessionMemory = new SessionMemory({ sessionId: 'user_123', ttlMinutes: 30, // Session expires after 30 minutes of inactivity}); const agentWithSession = new RealtimeAgent({ config: { goal: 'Multi-user voice support', tts: true }, llm, stt, tts, memory: sessionMemory,});Conseils Agents Vocaux
- Gardez les réponses courtes et conversationnelles - les utilisateurs écoutent, ils ne lisent pas
- Utilisez le streaming pour les applications temps réel pour minimiser la latence perçue
- Choisissez la bonne voix pour votre cas d'usage - professionnelle pour le support, amicale pour les ventes
- Utilisez tts-1 pour la vitesse, tts-1-hd pour la qualité dans les interactions importantes
- Gérez toujours les erreurs gracieusement - les utilisateurs vocaux ne peuvent pas voir les messages d'erreur