OrkaJS
Orka.JS

Multimodal

Send images, screenshots, and audio alongside text to your LLM. Orka AI supports multimodal inputs natively through OpenAI and Anthropic adapters.

How It Works

Multimodal support in Orka AI is built on the ChatMessage and ContentPart types. Instead of sending a plain string prompt, you compose messages with mixed content parts: text, images (URL or base64), and audio.

🧩 Content Part Types

  • text β€” Plain text content
  • image_url β€” Image from a URL (with detail level: auto, low, high)
  • image_base64 β€” Image encoded in base64 (PNG, JPEG, GIF, WebP)
  • audio β€” Audio data in WAV or MP3 format (OpenAI only)

# Image Analysis (URL)

The simplest way to analyze an image is to pass its URL. The LLM will download and process the image automatically. This works with both OpenAI (GPT-4o, GPT-4o-mini) and Anthropic (Claude 3.5 Sonnet, Claude 3 Opus).

import { createOrka } from 'orkajs/core';
import { OpenAIAdapter } from 'orkajs/adapters';
Β 
const orka = createOrka({
llm: new OpenAIAdapter({
apiKey: process.env.OPENAI_API_KEY!,
model: 'gpt-4o' // Must use a vision-capable model
})
});
Β 
const result = await orka.getLLM().generate('', {
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'What do you see in this image? Describe it in detail.' },
{
type: 'image_url',
image_url: {
url: 'https://example.com/photo.jpg',
detail: 'high' // 'auto' | 'low' | 'high'
}
}
]
}
]
});
Β 
console.log(result.content);
// "The image shows a sunset over the ocean with..."

auto

The model decides the detail level based on the image size. Best default choice.

low

Faster and cheaper. Uses a 512Γ—512 thumbnail. Good for simple classification.

high

Full resolution analysis. Best for OCR, detailed descriptions, and small text reading.

# Image Analysis (Base64)

For local files or dynamically generated images, encode them in base64. This avoids the need for a public URL and works with both OpenAI and Anthropic.

import { readFileSync } from 'fs';
Β 
// Read local image file
const imageBuffer = readFileSync('./screenshot.png');
const base64Image = imageBuffer.toString('base64');
Β 
const result = await orka.getLLM().generate('', {
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Extract all text from this screenshot.' },
{
type: 'image_base64',
data: base64Image,
mimeType: 'image/png' // 'image/png' | 'image/jpeg' | 'image/gif' | 'image/webp'
}
]
}
]
});
Β 
console.log(result.content);

# Multiple Images

You can send multiple images in a single message for comparison, analysis, or multi-page document processing.

const result = await orka.getLLM().generate('', {
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Compare these two UI designs. Which one is better and why?' },
{
type: 'image_url',
image_url: { url: 'https://example.com/design-a.png', detail: 'high' }
},
{
type: 'image_url',
image_url: { url: 'https://example.com/design-b.png', detail: 'high' }
}
]
}
]
});

# Audio Input (OpenAI)

OpenAI's GPT-4o models support audio input. Send audio data in WAV or MP3 format for transcription, analysis, or voice-based interaction.

import { readFileSync } from 'fs';
Β 
const audioBuffer = readFileSync('./recording.wav');
const base64Audio = audioBuffer.toString('base64');
Β 
const result = await orka.getLLM().generate('', {
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Transcribe this audio and summarize the key points.' },
{
type: 'audio',
data: base64Audio,
format: 'wav' // 'wav' | 'mp3'
}
]
}
]
});
Β 
console.log(result.content);
// "The speaker discusses three main topics: ..."

⚠️ Audio Limitations

  • Audio input is currently supported only by OpenAI (GPT-4o models)
  • Anthropic (Claude) supports audio input starting with Claude 4.6 Sonnet
  • Gemini best model for audio are (3.1 Pro) & (3 Flash)
  • Maximum audio length depends on the model and your API plan

# With System Prompt

Combine multimodal content with system prompts for specialized analysis tasks.

const result = await orka.getLLM().generate('', {
messages: [
{
role: 'system',
content: 'You are an expert radiologist. Analyze medical images with precision and provide structured reports.'
},
{
role: 'user',
content: [
{ type: 'text', text: 'Please analyze this X-ray image.' },
{
type: 'image_url',
image_url: { url: 'https://example.com/xray.jpg', detail: 'high' }
}
]
}
]
});

# Multi-turn Conversations

Build multi-turn conversations that reference previously shared images.

const result = await orka.getLLM().generate('', {
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Here is a photo of my living room.' },
{ type: 'image_url', image_url: { url: 'https://example.com/room.jpg' } }
]
},
{
role: 'assistant',
content: 'I can see a modern living room with a gray sofa, wooden coffee table...'
},
{
role: 'user',
content: 'What color should I paint the walls to complement the furniture?'
}
]
});

Provider Compatibility

FeatureOpenAIAnthropicMistralOllama
Image (URL)βœ…βœ…βŒβŒ
Image (Base64)βœ…βœ…βŒβŒ
Audioβœ…βœ…βŒβŒ
Multiple imagesβœ…βœ…βŒβŒ

Use Cases

πŸ“Έ Document OCR

Extract text from scanned documents, receipts, invoices, and handwritten notes.

🎨 UI/UX Analysis

Analyze screenshots for accessibility issues, design feedback, and component identification.

πŸ“Š Chart & Data Extraction

Extract data from charts, graphs, and tables in images for further processing.

πŸŽ™οΈ Voice Transcription

Transcribe meetings, interviews, and voice memos with context-aware summarization.

TypeScript Types

import type { ChatMessage, ContentPart } from 'orkajs';
Β 
// ChatMessage
interface ChatMessage {
role: 'system' | 'user' | 'assistant';
content: string | ContentPart[];
}
Β 
// ContentPart β€” union type
type ContentPart =
| { type: 'text'; text: string }
| { type: 'image_url'; image_url: { url: string; detail?: 'auto' | 'low' | 'high' } }
| { type: 'image_base64'; data: string; mimeType: 'image/png' | 'image/jpeg' | 'image/gif' | 'image/webp' }
| { type: 'audio'; data: string; format: 'wav' | 'mp3' };

Best Practices

1. Choose the Right Detail Level

Use 'low' for simple classification tasks to save tokens and cost. Use 'high' for OCR and detailed analysis.

2. Optimize Image Size

Resize large images before sending to reduce token usage. Most models work well with images under 2048Γ—2048.

3. Use Base64 for Sensitive Data

For private or sensitive images, use base64 encoding instead of URLs to avoid exposing data publicly.