Traitement Multimodal
Construisez des applications qui comprennent les images, l'audio et le texte ensemble avec VisionAgent, AudioAgent et MultimodalAgent.
L'IA multimodale combine le traitement de la vision, de l'audio et du texte pour créer des applications riches et contextuelles. OrkaJS fournit des agents et utilitaires spécialisés pour chaque modalité.
extractText()
describeImage()
speak()
Whisper API
🖼️ Vision
Analyse d'image, OCR, comparaison
🎙️ Audio
Transcription Whisper, TTS
🔀 Cross-modal
Workflows vision + audio combinés
Analyse de Documents avec VisionAgent
Extrayez du texte et analysez des documents à partir d'images avec VisionAgent.
import { VisionAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai';import { readFileSync } from 'fs'; const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o'}); const visionAgent = new VisionAgent({ llm, systemPrompt: 'You are an expert document analyst. Extract information accurately.', detail: 'high', temperature: 0.1}); // Process an invoiceconst invoiceImage = readFileSync('./invoice.png');const base64 = invoiceImage.toString('base64'); const ocrResult = await visionAgent.extractText({ type: 'base64', data: base64, mimeType: 'image/png'}); console.log('Extracted text:', ocrResult.result); // Ask specific questions about the documentconst answer = await visionAgent.ask( { type: 'base64', data: base64, mimeType: 'image/png' }, 'What is the total amount and due date on this invoice?'); console.log('Invoice details:', answer); // Batch process multiple documentsconst results = await visionAgent.runTasks([ { type: 'ocr', image: { type: 'url', url: 'https://example.com/doc1.png' } }, { type: 'ocr', image: { type: 'url', url: 'https://example.com/doc2.png' } }, { type: 'describe', image: { type: 'url', url: 'https://example.com/chart.png' } }]); results.forEach((r, i) => { console.log(`Document ${i + 1} (${r.task}):`, r.result);});Transcription de Réunion avec AudioAgent
Transcrivez des réunions et générez des réponses audio avec AudioAgent.
import { AudioAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai';import { readFileSync, writeFileSync } from 'fs'; const adapter = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, whisperModel: 'whisper-1', ttsModel: 'tts-1-hd', ttsVoice: 'nova'}); const audioAgent = new AudioAgent({ adapter, defaultLanguage: 'en', defaultVoice: 'nova', defaultFormat: 'mp3'}); // Transcribe a meeting recordingconst meetingAudio = readFileSync('./meeting.mp3');const transcription = await audioAgent.transcribe({ type: 'buffer', data: meetingAudio.buffer}, { includeTimestamps: true }); console.log('Meeting transcript:', transcription.result);console.log('Duration:', transcription.metadata?.duration, 'seconds'); // Generate a voice summaryconst summaryText = 'The meeting covered three main topics: Q4 results, 2024 roadmap, and team expansion.';const voiceSummary = await audioAgent.speak(summaryText, { voice: 'onyx', speed: 1.1}); writeFileSync('./meeting-summary.mp3', Buffer.from(voiceSummary.result as ArrayBuffer)); // Transcribe and process in one stepconst processed = await audioAgent.transcribeAndProcess( { type: 'buffer', data: meetingAudio.buffer }, async (text) => { // You could use an LLM here to summarize const textSentences = text.split('. '); return `Key points (${textSentences.length} sentences): ${textSentences.slice(0, 3).join('. ')}...`; }); console.log('Processed:', processed.processed);Analyse de Présentation avec MultimodalAgent
Analysez des présentations en combinant les slides (images) avec les notes du présentateur (audio).
import { MultimodalAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai'; const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o'}); const multimodalAgent = new MultimodalAgent({ llm, audioAdapter: llm, systemPrompt: `You are an expert presentation analyst. Analyze slides and speaker audio together to provide comprehensive insights.Focus on: key messages, data points, and recommendations.`, maxTokens: 4096}); // Analyze a presentation with slides and audioconst result = await multimodalAgent.process({ text: 'Analyze this presentation. What are the key takeaways?', images: [ { type: 'url', url: 'https://example.com/slide1.png' }, { type: 'url', url: 'https://example.com/slide2.png' }, { type: 'url', url: 'https://example.com/slide3.png' } ], audio: [ { type: 'base64', data: speakerAudioBase64 } ]}); console.log('Analysis:', result.response);console.log('Transcribed audio:', result.transcriptions);console.log('Tokens used:', result.usage.totalTokens); // Follow-up questionsconst followUp = await multimodalAgent.ask( 'What specific metrics were mentioned in the presentation?', { images: [{ type: 'url', url: 'https://example.com/slide2.png' }] }); console.log('Metrics:', followUp); // Compare before/after slidesconst comparison = await multimodalAgent.analyzeImages( [ { type: 'url', url: 'https://example.com/q3-results.png' }, { type: 'url', url: 'https://example.com/q4-results.png' } ], 'Compare Q3 and Q4 performance. What improved and what declined?'); console.log('Comparison:', comparison);Bot de Support Client
Construisez un bot de support qui comprend les captures d'écran, messages vocaux et texte.
import { MultimodalAgent, isVisionCapable, isAudioCapable } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai'; const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o'}); // Verify capabilitiesconsole.log('Vision support:', isVisionCapable(llm));console.log('Audio support:', isAudioCapable(llm)); const supportBot = new MultimodalAgent({ llm, audioAdapter: llm, systemPrompt: `You are a helpful customer support agent for a software product.When users share screenshots, identify the issue and provide step-by-step solutions.When users share voice messages, transcribe and respond appropriately.Be concise, friendly, and solution-oriented.`}); // Handle a support request with screenshotasync function handleSupportRequest(request: { text?: string; screenshot?: string; // base64 voiceMessage?: string; // base64}) { const images = request.screenshot ? [{ type: 'base64' as const, data: request.screenshot, mimeType: 'image/png' as const }] : undefined; const audio = request.voiceMessage ? [{ type: 'base64' as const, data: request.voiceMessage }] : undefined; const result = await supportBot.process({ text: request.text || 'Please help me with this issue.', images, audio }); return { response: result.response, transcription: result.transcriptions?.[0], processingTime: result.latencyMs };} // Example usageconst response = await handleSupportRequest({ text: 'I keep getting this error when I try to export', screenshot: errorScreenshotBase64}); console.log('Support response:', response.response);💡 Conseils pour la Production
- Utilisez 'low' pour la classification simple, 'high' pour l'OCR
- Compressez les images avant l'envoi pour réduire les coûts
- Mettez en cache les transcriptions pour le contenu audio répété
- Utilisez isVisionCapable() et isAudioCapable() pour vérifier le support
- Définissez des timeouts appropriés pour les gros fichiers audio
Exemple Complet : Pipeline Image vers Audio
import { VisionAgent, AudioAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai';import { writeFileSync } from 'fs'; const adapter = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o', ttsModel: 'tts-1-hd', ttsVoice: 'nova'}); const visionAgent = new VisionAgent({ llm: adapter });const audioAgent = new AudioAgent({ adapter }); // Pipeline: Image → Description → Audioasync function imageToAudio(imageUrl: string): Promise<ArrayBuffer> { // Step 1: Analyze the image const description = await visionAgent.describe({ type: 'url', url: imageUrl }); console.log('Image description:', description.result); // Step 2: Generate audio narration const narration = typeof description.result === 'object' ? (description.result as { description: string }).description : String(description.result); const audio = await audioAgent.speak( `This image shows: ${narration}`, { voice: 'nova', speed: 0.9 } ); return audio.result as ArrayBuffer;} // Usageconst audioBuffer = await imageToAudio('https://example.com/landscape.jpg');writeFileSync('./image-narration.mp3', Buffer.from(audioBuffer));console.log('Audio narration saved!');