Multimodal
Send images, screenshots, and audio alongside text to your LLM. Orka AI supports multimodal inputs natively through OpenAI and Anthropic adapters.
How It Works
Multimodal support in Orka AI is built on the ChatMessage and ContentPart types. Instead of sending a plain string prompt, you compose messages with mixed content parts: text, images (URL or base64), and audio.
π§© Content Part Types
- text β Plain text content
- image_url β Image from a URL (with detail level: auto, low, high)
- image_base64 β Image encoded in base64 (PNG, JPEG, GIF, WebP)
- audio β Audio data in WAV or MP3 format (OpenAI only)
# Image Analysis (URL)
The simplest way to analyze an image is to pass its URL. The LLM will download and process the image automatically. This works with both OpenAI (GPT-4o, GPT-4o-mini) and Anthropic (Claude 3.5 Sonnet, Claude 3 Opus).
import { createOrka } from 'orkajs/core';import { OpenAIAdapter } from 'orkajs/adapters';Β const orka = createOrka({ llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o' // Must use a vision-capable model })});Β const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'What do you see in this image? Describe it in detail.' }, { type: 'image_url', image_url: { url: 'https://example.com/photo.jpg', detail: 'high' // 'auto' | 'low' | 'high' } } ] } ]});Β console.log(result.content);// "The image shows a sunset over the ocean with..."auto
The model decides the detail level based on the image size. Best default choice.
low
Faster and cheaper. Uses a 512Γ512 thumbnail. Good for simple classification.
high
Full resolution analysis. Best for OCR, detailed descriptions, and small text reading.
# Image Analysis (Base64)
For local files or dynamically generated images, encode them in base64. This avoids the need for a public URL and works with both OpenAI and Anthropic.
import { readFileSync } from 'fs';Β // Read local image fileconst imageBuffer = readFileSync('./screenshot.png');const base64Image = imageBuffer.toString('base64');Β const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'Extract all text from this screenshot.' }, { type: 'image_base64', data: base64Image, mimeType: 'image/png' // 'image/png' | 'image/jpeg' | 'image/gif' | 'image/webp' } ] } ]});Β console.log(result.content);# Multiple Images
You can send multiple images in a single message for comparison, analysis, or multi-page document processing.
const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'Compare these two UI designs. Which one is better and why?' }, { type: 'image_url', image_url: { url: 'https://example.com/design-a.png', detail: 'high' } }, { type: 'image_url', image_url: { url: 'https://example.com/design-b.png', detail: 'high' } } ] } ]});# Audio Input (OpenAI)
OpenAI's GPT-4o models support audio input. Send audio data in WAV or MP3 format for transcription, analysis, or voice-based interaction.
import { readFileSync } from 'fs';Β const audioBuffer = readFileSync('./recording.wav');const base64Audio = audioBuffer.toString('base64');Β const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'Transcribe this audio and summarize the key points.' }, { type: 'audio', data: base64Audio, format: 'wav' // 'wav' | 'mp3' } ] } ]});Β console.log(result.content);// "The speaker discusses three main topics: ..."β οΈ Audio Limitations
- Audio input is currently supported only by OpenAI (GPT-4o models)
- Anthropic (Claude) supports audio input starting with Claude 4.6 Sonnet
- Gemini best model for audio are (3.1 Pro) & (3 Flash)
- Maximum audio length depends on the model and your API plan
# With System Prompt
Combine multimodal content with system prompts for specialized analysis tasks.
const result = await orka.getLLM().generate('', { messages: [ { role: 'system', content: 'You are an expert radiologist. Analyze medical images with precision and provide structured reports.' }, { role: 'user', content: [ { type: 'text', text: 'Please analyze this X-ray image.' }, { type: 'image_url', image_url: { url: 'https://example.com/xray.jpg', detail: 'high' } } ] } ]});# Multi-turn Conversations
Build multi-turn conversations that reference previously shared images.
const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'Here is a photo of my living room.' }, { type: 'image_url', image_url: { url: 'https://example.com/room.jpg' } } ] }, { role: 'assistant', content: 'I can see a modern living room with a gray sofa, wooden coffee table...' }, { role: 'user', content: 'What color should I paint the walls to complement the furniture?' } ]});Provider Compatibility
| Feature | OpenAI | Anthropic | Mistral | Ollama |
|---|---|---|---|---|
| Image (URL) | β | β | β | β |
| Image (Base64) | β | β | β | β |
| Audio | β | β | β | β |
| Multiple images | β | β | β | β |
Use Cases
πΈ Document OCR
Extract text from scanned documents, receipts, invoices, and handwritten notes.
π¨ UI/UX Analysis
Analyze screenshots for accessibility issues, design feedback, and component identification.
π Chart & Data Extraction
Extract data from charts, graphs, and tables in images for further processing.
ποΈ Voice Transcription
Transcribe meetings, interviews, and voice memos with context-aware summarization.
TypeScript Types
import type { ChatMessage, ContentPart } from 'orkajs';Β // ChatMessageinterface ChatMessage { role: 'system' | 'user' | 'assistant'; content: string | ContentPart[];}Β // ContentPart β union typetype ContentPart = | { type: 'text'; text: string } | { type: 'image_url'; image_url: { url: string; detail?: 'auto' | 'low' | 'high' } } | { type: 'image_base64'; data: string; mimeType: 'image/png' | 'image/jpeg' | 'image/gif' | 'image/webp' } | { type: 'audio'; data: string; format: 'wav' | 'mp3' };Best Practices
1. Choose the Right Detail Level
Use 'low' for simple classification tasks to save tokens and cost. Use 'high' for OCR and detailed analysis.
2. Optimize Image Size
Resize large images before sending to reduce token usage. Most models work well with images under 2048Γ2048.
3. Use Base64 for Sensitive Data
For private or sensitive images, use base64 encoding instead of URLs to avoid exposing data publicly.