OrkaJS
Orka.JS

Document Loaders

Load data from various sources: PDF, CSV, JSON, Markdown, text files, and directories.

Overview

Document loaders transform raw data from different sources into a unified Document format that Orka AI can process. Each loader handles specific file types and extraction logic.

What is a Document?

All loaders return an array of Document objects with this structure:

interface Document {
id: string; // Unique identifier
content: string; // The actual text content
metadata: {
source?: string; // File path or URL
loader?: string; // Loader name
...customFields // Your custom metadata
};
}

# TextLoader

Load plain text files with custom encoding support. The simplest loader — reads a file and returns its content as a single document.

import { TextLoader } from 'orkajs/loaders/text';
 
const loader = new TextLoader('./document.txt', {
encoding: 'utf-8',
metadata: { source: 'documentation' }
});
 
const documents = await loader.load();
// [{ id: '...', content: '...', metadata: { source: 'documentation', loader: 'TextLoader' } }]

Constructor Parameters

path: string

Absolute or relative path to the text file.

options.encoding?: string

Character encoding (default: 'utf-8'). Supports 'utf-8', 'ascii', 'latin1', etc.

options.metadata?: Record<string, unknown>

Custom metadata to attach to the document. Useful for categorization, filtering, or tracking.

# CSVLoader

Parse CSV files with support for custom separators, column selection, and content extraction. Each row becomes a separate document, making it perfect for loading structured data like product catalogs, user lists, or FAQ databases.

import { CSVLoader } from 'orkajs/loaders/csv';
 
// Option 1: Use specific column as content
const loader = new CSVLoader('./data.csv', {
separator: ',',
contentColumn: 'description', // Use this column as document content
metadata: { type: 'product_data' }
});
 
// Option 2: Combine multiple columns
const loader2 = new CSVLoader('./users.csv', {
columns: ['name', 'bio', 'interests'], // Combine these columns
});
 
const documents = await loader.load();
// Each row becomes a separate document

💡 CSV Features

  • ✅ Handles quoted fields with commas
  • ✅ Custom separators (comma, semicolon, tab)
  • ✅ Automatic metadata extraction from columns

# JSONLoader

Load JSON files or objects with JSONPath support for nested data extraction. Handles both single objects and arrays, with flexible field mapping for content and metadata.

import { JSONLoader } from 'orkajs/loaders/json';
 
// Load from file
const loader = new JSONLoader('./data.json', {
contentField: 'text', // Use this field as content
metadataFields: ['author', 'date'], // Extract these as metadata
jsonPath: '$.articles' // Extract nested array
});
 
// Load from object
const data = [
{ text: 'Article 1', author: 'Alice' },
{ text: 'Article 2', author: 'Bob' }
];
const loader2 = new JSONLoader(data, {
contentField: 'text'
});
 
const documents = await loader.load();

# MarkdownLoader

Load Markdown files with frontmatter extraction and header parsing. Perfect for documentation, blog posts, or any content with YAML frontmatter metadata.

import { MarkdownLoader } from 'orkajs/loaders/markdown';
 
const loader = new MarkdownLoader('./README.md', {
removeFrontmatter: true, // Extract YAML frontmatter
includeHeaders: true, // Extract all headers as metadata
metadata: { type: 'documentation' }
});
 
const documents = await loader.load();
// Frontmatter fields are added to metadata
// Headers are available in metadata.headers

Example Markdown with Frontmatter

---
title: Getting Started
author: Alice
date: 2024-01-15
---
 
# Introduction
 
This is the content...

# PDFLoader

Extract text from PDF files with page selection and metadata extraction. Each page becomes a separate document with page number tracking, ideal for research papers, reports, or manuals.

📦 Installation Required

PDFLoader requires the pdf-parse package:

npm install pdf-parse
import { PDFLoader } from 'orkajs/loaders/pdf';
 
// Load entire PDF
const loader = new PDFLoader('./document.pdf', {
metadata: { source: 'research_paper' }
});
 
// Load specific pages
const loader2 = new PDFLoader('./report.pdf', {
pages: [1, 2, 3], // Only pages 1, 2, 3
maxPages: 10 // Or limit to first 10 pages
});
 
const documents = await loader.load();
// Each page becomes a separate document with page number in metadata

# DirectoryLoader

Recursively load all supported files from a directory with automatic loader detection. The most powerful loader — scans entire directory trees, detects file types, and applies the appropriate loader automatically.

import { DirectoryLoader } from 'orkajs/loaders/directory';
 
const loader = new DirectoryLoader('./docs', {
recursive: true, // Scan subdirectories
glob: '*.md', // Filter by pattern
exclude: ['node_modules', '.git'], // Exclude folders
metadata: { project: 'orka-docs' }
});
 
const documents = await loader.load();
// Automatically detects and uses the right loader for each file type
ExtensionLoader
.txtTextLoader
.md, .mdxMarkdownLoader
.csvCSVLoader
.json, .jsonlJSONLoader
.ts, .js, .py, .html, .cssTextLoader

Using with Knowledge

Loaders integrate seamlessly with Orka's Knowledge system. Here's a complete pipeline from loading documents to creating a searchable knowledge base:

create-knowledge-base.ts
import { createOrka } from 'orkajs';
import { DirectoryLoader } from 'orkajs/loaders/directory';
import { RecursiveCharacterTextSplitter } from 'orkajs/splitters/recursive';
 
const orka = createOrka({ /* config */ });
 
// Load documents
const loader = new DirectoryLoader('./knowledge-base');
const documents = await loader.load();
 
// Split into chunks
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
const chunks = splitter.splitDocuments(documents);
 
// Create knowledge base
await orka.knowledge.create({
name: 'my-knowledge',
source: documents.map(d => ({ text: d.content, metadata: d.metadata }))
});

Tree-shaking Imports

Import only what you need to minimize bundle size:

// ❌ Imports everything
import { CSVLoader, PDFLoader } from 'orkajs';
 
// ✅ Tree-shakeable - only bundles CSVLoader
import { CSVLoader } from 'orkajs/loaders/csv';
 
// ✅ Import all loaders from index
import { CSVLoader, PDFLoader, JSONLoader } from 'orkajs/loaders';