Agents Vocaux Temps Réel

Construisez des agents IA vocaux

Créez des agents vocaux conversationnels avec un pipeline audio complet : transcription speech-to-text, traitement LLM avec exécution d'outils, et synthèse text-to-speech. Le RealtimeAgent gère tout le flux, vous permettant de construire des assistants vocaux, bots téléphoniques, applications vocales interactives et fonctionnalités d'accessibilité en quelques lignes de code.

Installation

Installez le package realtime. Il inclut des adaptateurs pour OpenAI Whisper (STT) et OpenAI TTS, avec support pour adaptateurs personnalisés.

# ─────────────────────────────────────────────────────────────────────────────
# Install @orka-js/realtime for voice-enabled agents
# ─────────────────────────────────────────────────────────────────────────────
 
npm install @orka-js/realtime
 
# Or with pnpm
pnpm add @orka-js/realtime
 
# The package includes:
# - RealtimeAgent: Main voice agent class
# - OpenAISTTAdapter: Speech-to-text with Whisper
# - OpenAITTSAdapter: Text-to-speech with OpenAI voices
# - Types for audio events and streaming

Fonctionnalités Clés

# Speech-to-Text

Transcrivez l'audio avec OpenAI Whisper ou STT custom

# Text-to-Speech

Générez une voix naturelle avec OpenAI TTS

# Pipeline Streaming

Audio → transcript → LLM → audio en temps réel

# Exécution d'Outils

Les agents peuvent appeler des outils pendant les conversations vocales

# Formats Multiples

Support WAV, MP3, OGG, WebM audio

# WebSocket Ready

Intégration avec WebRTC pour conversations live

Utilisation Basique

Créez un agent vocal qui traite l'entrée audio et retourne des réponses texte et audio.

realtime-agent.ts

// ─────────────────────────────────────────────────────────────────────────────
// Basic Voice Agent
//
// This example shows how to:
// 1. Set up the STT (Speech-to-Text) adapter for transcription
// 2. Set up the TTS (Text-to-Speech) adapter for voice output
// 3. Create a RealtimeAgent that processes audio through the full pipeline
// 4. Process an audio file and get both text and audio responses
// ─────────────────────────────────────────────────────────────────────────────
 
import { RealtimeAgent, OpenAISTTAdapter, OpenAITTSAdapter } from '@orka-js/realtime';
import { OpenAIAdapter } from '@orka-js/openai';
import fs from 'fs';
 
// ─── Step 1: Create the LLM adapter ──────────────────────────────────────────
// This is the "brain" of your voice agent - it processes the transcribed text
 
const llm = new OpenAIAdapter({ 
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o',  // Use a fast model for real-time conversations
});
 
// ─── Step 2: Create the Speech-to-Text adapter ───────────────────────────────
// Converts audio input to text using OpenAI Whisper
 
const stt = new OpenAISTTAdapter({ 
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'whisper-1',     // OpenAI's Whisper model
  language: 'en',          // Optional: specify language for better accuracy
});
 
// ─── Step 3: Create the Text-to-Speech adapter ───────────────────────────────
// Converts the agent's text response back to audio
 
const tts = new OpenAITTSAdapter({ 
  apiKey: process.env.OPENAI_API_KEY!,
  voice: 'alloy',          // Voice options: alloy, echo, fable, onyx, nova, shimmer
  model: 'tts-1',          // Use 'tts-1-hd' for higher quality (slower)
  speed: 1.0,              // Speed: 0.25 to 4.0
  responseFormat: 'mp3',   // Output format: mp3, opus, aac, flac
});
 
// ─── Step 4: Define tools the agent can use ──────────────────────────────────
// Voice agents can execute tools just like text agents
 
const searchProducts = {
  name: 'searchProducts',
  description: 'Search for products in the catalog',
  parameters: {
    type: 'object',
    properties: {
      query: { type: 'string', description: 'Search query' },
    },
    required: ['query'],
  },
  execute: async ({ query }: { query: string }) => {
    // Your search logic here
    return { products: ['Product A', 'Product B'] };
  },
};
 
// ─── Step 5: Create the RealtimeAgent ────────────────────────────────────────
 
const agent = new RealtimeAgent({
  config: {
    // The agent's purpose - guides its responses
    goal: 'Help customers find products and answer questions about our store',
 
    // System prompt for voice-specific behavior
    systemPrompt: `You are a friendly voice assistant for our online store.
Keep responses concise and conversational - remember this is spoken, not written.
Avoid long lists or complex formatting. Be warm and helpful.`,
 
    // Enable TTS output (set to false for text-only responses)
    tts: true,
  },
 
  // The adapters
  llm,
  stt,
  tts,
 
  // Tools the agent can use during conversations
  tools: [searchProducts],
});
 
// ─── Step 6: Process audio input ─────────────────────────────────────────────
 
// Read audio file (supports WAV, MP3, OGG, WebM, M4A, FLAC)
const audioBuffer = fs.readFileSync('./customer-question.wav');
 
// Process the audio through the full pipeline:
// Audio → Transcription → LLM → Response → Audio
const result = await agent.process(audioBuffer, 'audio/wav');
 
// ─── Step 7: Use the results ─────────────────────────────────────────────────
 
console.log('User said:', result.transcript);
// "Do you have any running shoes on sale?"
 
console.log('Agent response:', result.response);
// "Yes! We have several running shoes on sale right now. Let me search for you..."
 
console.log('Tool calls:', result.toolCalls);
// [{ name: 'searchProducts', args: { query: 'running shoes sale' }, result: {...} }]
 
// Save the audio response
if (result.audio) {
  fs.writeFileSync('./response.mp3', result.audio);
  console.log('Audio response saved to response.mp3');
}

# Événements Streaming

Traitez l'audio en temps réel avec des événements streaming pour un feedback immédiat.

streaming.ts

// ─────────────────────────────────────────────────────────────────────────────
// Streaming Voice Processing
//
// For real-time applications, use streaming to get immediate feedback:
// - Transcript appears as soon as STT finishes
// - LLM tokens stream as they're generated
// - Audio chunks stream as TTS generates them
//
// This enables low-latency voice interactions where the user hears
// the response before the full generation is complete.
// ─────────────────────────────────────────────────────────────────────────────
 
import { RealtimeAgent } from '@orka-js/realtime';
 
const agent = new RealtimeAgent({ /* ... config ... */ });
 
// Process audio with streaming events
for await (const event of agent.processStream(audioBuffer, 'audio/wav')) {
  switch (event.type) {
    // ─── Transcription Event ─────────────────────────────────────────────────
    // Fired when STT completes transcription
    case 'transcript':
      console.log('🎤 User said:', event.text);
      // Display the transcript in your UI immediately
      updateUI({ userMessage: event.text });
      break;
 
    // ─── LLM Token Event ─────────────────────────────────────────────────────
    // Fired for each token as the LLM generates the response
    case 'token':
      process.stdout.write(event.token);
      // Accumulate tokens for display
      appendToResponse(event.token);
      break;
 
    // ─── Tool Call Events ────────────────────────────────────────────────────
    // Fired when the agent calls a tool
    case 'tool_start':
      console.log(`🔧 Calling tool: ${event.name}`);
      showToolIndicator(event.name);
      break;
 
    case 'tool_end':
      console.log(`✅ Tool result: ${JSON.stringify(event.result)}`);
      hideToolIndicator();
      break;
 
    // ─── Audio Chunk Event ───────────────────────────────────────────────────
    // Fired as TTS generates audio chunks
    // Send these to the client for immediate playback
    case 'audio_chunk':
      // Send chunk to WebSocket for real-time playback
      websocket.send(event.chunk);
 
      // Or accumulate for later
      audioChunks.push(event.chunk);
      break;
 
    // ─── Completion Event ────────────────────────────────────────────────────
    // Fired when the entire pipeline completes
    case 'done':
      console.log('\n✅ Complete!');
      console.log('Full response:', event.response);
      console.log('Total duration:', event.duration, 'ms');
      break;
 
    // ─── Error Event ─────────────────────────────────────────────────────────
    case 'error':
      console.error('❌ Error:', event.error);
      showErrorMessage(event.error.message);
      break;
  }
}

# Intégration WebSocket

Construisez des conversations vocales live avec WebSocket pour le streaming audio bidirectionnel.

websocket-server.ts

// ─────────────────────────────────────────────────────────────────────────────
// WebSocket Integration for Live Voice Conversations
//
// This example shows how to build a real-time voice chat using WebSocket.
// The client sends audio chunks, and the server streams back audio responses.
// ─────────────────────────────────────────────────────────────────────────────
 
import { WebSocketServer } from 'ws';
import { RealtimeAgent } from '@orka-js/realtime';
 
// Create the voice agent
const agent = new RealtimeAgent({
  config: {
    goal: 'Have natural voice conversations',
    tts: true,
  },
  llm, stt, tts,
  tools: [/* your tools */],
});
 
// Create WebSocket server
const wss = new WebSocketServer({ port: 8080 });
 
wss.on('connection', (ws) => {
  console.log('Client connected');
 
  // Buffer to accumulate audio chunks from client
  let audioBuffer: Buffer[] = [];
 
  ws.on('message', async (data, isBinary) => {
    if (isBinary) {
      // ─── Receiving Audio from Client ───────────────────────────────────────
      // Client sends audio chunks as binary data
      audioBuffer.push(data as Buffer);
    } else {
      // ─── Control Messages ──────────────────────────────────────────────────
      const message = JSON.parse(data.toString());
 
      if (message.type === 'end_audio') {
        // Client finished sending audio, process it
        const fullAudio = Buffer.concat(audioBuffer);
        audioBuffer = []; // Reset buffer
 
        // Process with streaming
        for await (const event of agent.processStream(fullAudio, 'audio/wav')) {
          if (event.type === 'transcript') {
            // Send transcript to client
            ws.send(JSON.stringify({ type: 'transcript', text: event.text }));
          } else if (event.type === 'token') {
            // Send text tokens for display
            ws.send(JSON.stringify({ type: 'token', token: event.token }));
          } else if (event.type === 'audio_chunk') {
            // Send audio chunk as binary for immediate playback
            ws.send(event.chunk);
          } else if (event.type === 'done') {
            ws.send(JSON.stringify({ type: 'done' }));
          }
        }
      }
    }
  });
 
  ws.on('close', () => {
    console.log('Client disconnected');
  });
});
 
console.log('Voice WebSocket server running on ws://localhost:8080');

# Configuration Vocale

Personnalisez la sortie vocale avec différentes voix, vitesses et formats.

voices.ts

// ─────────────────────────────────────────────────────────────────────────────
// Voice Configuration
//
// Customize the voice output for different use cases.
// ─────────────────────────────────────────────────────────────────────────────
 
import { OpenAITTSAdapter } from '@orka-js/realtime';
 
// ─── Available OpenAI Voices ─────────────────────────────────────────────────
// Each voice has a distinct personality:
//
// - alloy:   Neutral, versatile (good default)
// - echo:    Warm, conversational
// - fable:   Expressive, storytelling
// - onyx:    Deep, authoritative
// - nova:    Friendly, upbeat
// - shimmer: Clear, professional
 
// ─── Professional Customer Service Voice ─────────────────────────────────────
const professionalTTS = new OpenAITTSAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: 'shimmer',        // Clear and professional
  model: 'tts-1-hd',       // High quality for important interactions
  speed: 1.0,              // Normal speed
  responseFormat: 'mp3',
});
 
// ─── Friendly Assistant Voice ────────────────────────────────────────────────
const friendlyTTS = new OpenAITTSAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: 'nova',           // Friendly and upbeat
  model: 'tts-1',          // Standard quality for faster responses
  speed: 1.1,              // Slightly faster for energetic feel
  responseFormat: 'opus',  // Smaller file size for web
});
 
// ─── Audiobook / Storytelling Voice ──────────────────────────────────────────
const storytellerTTS = new OpenAITTSAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  voice: 'fable',          // Expressive for storytelling
  model: 'tts-1-hd',       // High quality for long-form content
  speed: 0.9,              // Slightly slower for clarity
  responseFormat: 'flac',  // Lossless for archival
});
 
// ─── Use Different Voices for Different Agents ───────────────────────────────
 
const supportAgent = new RealtimeAgent({
  config: { goal: 'Customer support', tts: true },
  llm, stt,
  tts: professionalTTS,  // Professional voice for support
});
 
const salesAgent = new RealtimeAgent({
  config: { goal: 'Sales assistance', tts: true },
  llm, stt,
  tts: friendlyTTS,      // Friendly voice for sales
});

# Mémoire Conversationnelle

Maintenez le contexte à travers plusieurs interactions vocales pour des conversations naturelles.

memory.ts

// ─────────────────────────────────────────────────────────────────────────────
// Conversation Memory for Voice Agents
//
// Maintain context across multiple voice interactions for natural conversations.
// The agent remembers what was said earlier in the conversation.
// ─────────────────────────────────────────────────────────────────────────────
 
import { RealtimeAgent } from '@orka-js/realtime';
import { Memory } from '@orka-js/memory-store';
 
// Create conversation memory
const memory = new Memory({
  maxMessages: 20,         // Keep last 20 messages
  strategy: 'sliding',     // Sliding window strategy
});
 
// Create agent with memory
const agent = new RealtimeAgent({
  config: {
    goal: 'Have natural multi-turn voice conversations',
    tts: true,
  },
  llm, stt, tts,
  memory,  // Attach memory to the agent
});
 
// ─── Multi-Turn Conversation ─────────────────────────────────────────────────
 
// Turn 1: User asks about products
const turn1 = await agent.process(audioBuffer1, 'audio/wav');
console.log('User:', turn1.transcript);
// "What laptops do you have?"
console.log('Agent:', turn1.response);
// "We have several laptops! We have the MacBook Pro, Dell XPS, and ThinkPad..."
 
// Turn 2: User follows up (agent remembers context)
const turn2 = await agent.process(audioBuffer2, 'audio/wav');
console.log('User:', turn2.transcript);
// "Which one is best for programming?"
console.log('Agent:', turn2.response);
// "For programming, I'd recommend the MacBook Pro or ThinkPad. 
//  The MacBook Pro has excellent build quality and the M3 chip is very fast..."
 
// Turn 3: User asks for comparison (agent still has context)
const turn3 = await agent.process(audioBuffer3, 'audio/wav');
console.log('User:', turn3.transcript);
// "How do they compare in price?"
console.log('Agent:', turn3.response);
// "The MacBook Pro starts at $1,999 while the ThinkPad starts at $1,299..."
 
// ─── Session Management ──────────────────────────────────────────────────────
 
// Clear memory to start a new conversation
memory.clear();
 
// Or use session IDs for multiple concurrent conversations
const sessionMemory = new SessionMemory({
  sessionId: 'user_123',
  ttlMinutes: 30,  // Session expires after 30 minutes of inactivity
});
 
const agentWithSession = new RealtimeAgent({
  config: { goal: 'Multi-user voice support', tts: true },
  llm, stt, tts,
  memory: sessionMemory,
});

Conseils Agents Vocaux

Gardez les réponses courtes et conversationnelles - les utilisateurs écoutent, ils ne lisent pas
Utilisez le streaming pour les applications temps réel pour minimiser la latence perçue
Choisissez la bonne voix pour votre cas d'usage - professionnelle pour le support, amicale pour les ventes
Utilisez tts-1 pour la vitesse, tts-1-hd pour la qualité dans les interactions importantes
Gérez toujours les erreurs gracieusement - les utilisateurs vocaux ne peuvent pas voir les messages d'erreur