Document Loaders
Load data from various sources: PDF, CSV, JSON, Markdown, text files, and directories.
Overview
Document loaders transform raw data from different sources into a unified Document format that Orka AI can process. Each loader handles specific file types and extraction logic.
What is a Document?
All loaders return an array of Document objects with this structure:
interface Document { id: string; // Unique identifier content: string; // The actual text content metadata: { source?: string; // File path or URL loader?: string; // Loader name ...customFields // Your custom metadata };}# TextLoader
Load plain text files with custom encoding support. The simplest loader — reads a file and returns its content as a single document.
import { TextLoader } from 'orkajs/loaders/text'; const loader = new TextLoader('./document.txt', { encoding: 'utf-8', metadata: { source: 'documentation' }}); const documents = await loader.load();// [{ id: '...', content: '...', metadata: { source: 'documentation', loader: 'TextLoader' } }]Constructor Parameters
path: stringAbsolute or relative path to the text file.
options.encoding?: stringCharacter encoding (default: 'utf-8'). Supports 'utf-8', 'ascii', 'latin1', etc.
options.metadata?: Record<string, unknown>Custom metadata to attach to the document. Useful for categorization, filtering, or tracking.
# CSVLoader
Parse CSV files with support for custom separators, column selection, and content extraction. Each row becomes a separate document, making it perfect for loading structured data like product catalogs, user lists, or FAQ databases.
import { CSVLoader } from 'orkajs/loaders/csv'; // Option 1: Use specific column as contentconst loader = new CSVLoader('./data.csv', { separator: ',', contentColumn: 'description', // Use this column as document content metadata: { type: 'product_data' }}); // Option 2: Combine multiple columnsconst loader2 = new CSVLoader('./users.csv', { columns: ['name', 'bio', 'interests'], // Combine these columns}); const documents = await loader.load();// Each row becomes a separate document💡 CSV Features
- ✅ Handles quoted fields with commas
- ✅ Custom separators (comma, semicolon, tab)
- ✅ Automatic metadata extraction from columns
# JSONLoader
Load JSON files or objects with JSONPath support for nested data extraction. Handles both single objects and arrays, with flexible field mapping for content and metadata.
import { JSONLoader } from 'orkajs/loaders/json'; // Load from fileconst loader = new JSONLoader('./data.json', { contentField: 'text', // Use this field as content metadataFields: ['author', 'date'], // Extract these as metadata jsonPath: '$.articles' // Extract nested array}); // Load from objectconst data = [ { text: 'Article 1', author: 'Alice' }, { text: 'Article 2', author: 'Bob' }];const loader2 = new JSONLoader(data, { contentField: 'text'}); const documents = await loader.load();# MarkdownLoader
Load Markdown files with frontmatter extraction and header parsing. Perfect for documentation, blog posts, or any content with YAML frontmatter metadata.
import { MarkdownLoader } from 'orkajs/loaders/markdown'; const loader = new MarkdownLoader('./README.md', { removeFrontmatter: true, // Extract YAML frontmatter includeHeaders: true, // Extract all headers as metadata metadata: { type: 'documentation' }}); const documents = await loader.load();// Frontmatter fields are added to metadata// Headers are available in metadata.headersExample Markdown with Frontmatter
---title: Getting Startedauthor: Alicedate: 2024-01-15--- # Introduction This is the content...# PDFLoader
Extract text from PDF files with page selection and metadata extraction. Each page becomes a separate document with page number tracking, ideal for research papers, reports, or manuals.
📦 Installation Required
PDFLoader requires the pdf-parse package:
npm install pdf-parseimport { PDFLoader } from 'orkajs/loaders/pdf'; // Load entire PDFconst loader = new PDFLoader('./document.pdf', { metadata: { source: 'research_paper' }}); // Load specific pagesconst loader2 = new PDFLoader('./report.pdf', { pages: [1, 2, 3], // Only pages 1, 2, 3 maxPages: 10 // Or limit to first 10 pages}); const documents = await loader.load();// Each page becomes a separate document with page number in metadata# DirectoryLoader
Recursively load all supported files from a directory with automatic loader detection. The most powerful loader — scans entire directory trees, detects file types, and applies the appropriate loader automatically.
import { DirectoryLoader } from 'orkajs/loaders/directory'; const loader = new DirectoryLoader('./docs', { recursive: true, // Scan subdirectories glob: '*.md', // Filter by pattern exclude: ['node_modules', '.git'], // Exclude folders metadata: { project: 'orka-docs' }}); const documents = await loader.load();// Automatically detects and uses the right loader for each file type| Extension | Loader |
|---|---|
.txt | TextLoader |
.md, .mdx | MarkdownLoader |
.csv | CSVLoader |
.json, .jsonl | JSONLoader |
.ts, .js, .py, .html, .css | TextLoader |
Using with Knowledge
Loaders integrate seamlessly with Orka's Knowledge system. Here's a complete pipeline from loading documents to creating a searchable knowledge base:
import { createOrka } from 'orkajs';import { DirectoryLoader } from 'orkajs/loaders/directory';import { RecursiveCharacterTextSplitter } from 'orkajs/splitters/recursive'; const orka = createOrka({ /* config */ }); // Load documentsconst loader = new DirectoryLoader('./knowledge-base');const documents = await loader.load(); // Split into chunksconst splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200});const chunks = splitter.splitDocuments(documents); // Create knowledge baseawait orka.knowledge.create({ name: 'my-knowledge', source: documents.map(d => ({ text: d.content, metadata: d.metadata }))});Tree-shaking Imports
Import only what you need to minimize bundle size:
// ❌ Imports everythingimport { CSVLoader, PDFLoader } from 'orkajs'; // ✅ Tree-shakeable - only bundles CSVLoaderimport { CSVLoader } from 'orkajs/loaders/csv'; // ✅ Import all loaders from indeximport { CSVLoader, PDFLoader, JSONLoader } from 'orkajs/loaders';