Text Splitters
Split documents into chunks optimized for embeddings and retrieval with intelligent strategies.
Why Split Text?
Large documents need to be split into smaller chunks for effective semantic search and context injection. Good splitting preserves meaning and respects document structure.
Optimal Chunk Size
500-1000 chars for Q&A, 1500-2000 for summarization
Chunk Overlap
10-20% overlap preserves context at boundaries
Structure Aware
Split on paragraphs, sentences, not mid-word
# RecursiveCharacterTextSplitter
The most versatile splitter. Uses hierarchical separators to split text while preserving semantic boundaries. Tries to split on paragraphs first, then sentences, then words, and finally characters as a last resort.
import { RecursiveCharacterTextSplitter } from 'orkajs/splitters/recursive'; const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, // Target chunk size in characters chunkOverlap: 200, // Overlap between chunks separators: ['\n\n', '\n', '. ', ' ', ''], // Try these in order keepSeparator: true, // Keep separators in chunks trimWhitespace: true // Remove leading/trailing whitespace}); const text = `Long document content here...`;const chunks = splitter.split(text); // Or split multiple documentsconst documents = [ { id: '1', content: 'Doc 1...', metadata: {} }, { id: '2', content: 'Doc 2...', metadata: {} }];const allChunks = splitter.splitDocuments(documents);🎯 How It Works
- Tries to split on double newlines (paragraphs)
- If chunks are still too large, tries single newlines
- Then sentences (periods), then words (spaces)
- Finally characters as last resort
- Maintains overlap for context continuity
# MarkdownTextSplitter
Specialized splitter for Markdown that respects document structure: headers, code blocks, and lists. Perfect for documentation where maintaining the hierarchy and code examples is crucial.
import { MarkdownTextSplitter } from 'orkajs/splitters/markdown'; const splitter = new MarkdownTextSplitter({ chunkSize: 1000, chunkOverlap: 200}); const markdown = `## Introduction This is a paragraph... ### Subsection More content here... \`\`\`typescriptconst code = 'example';\`\`\``; const chunks = splitter.split(markdown);// Splits at headers, preserving structure| Separator | Priority | Description |
|---|---|---|
\\n## | 1 | H2 headers |
\\n### | 2 | H3 headers |
\\n\`\`\`\\n | 3 | Code blocks |
\\n---\\n | 4 | Horizontal rules |
\\n\\n | 5 | Paragraphs |
# CodeTextSplitter
Language-aware splitter that respects code structure: classes, functions, and blocks. Uses language-specific separators to split at natural boundaries like function definitions, class declarations, and import statements.
import { CodeTextSplitter } from 'orkajs/splitters/code'; const splitter = new CodeTextSplitter({ language: 'typescript', // or 'python', 'javascript', 'java', etc. chunkSize: 1000, chunkOverlap: 200}); const code = `export class MyClass { constructor() {} method1() { // implementation } method2() { // implementation }} export function helperFunction() { // implementation}`; const chunks = splitter.split(code);// Splits at class/function boundariesSupported Languages
# TokenTextSplitter
Split text based on estimated token count, useful for staying within LLM context limits. Uses a character-to-token ratio estimation (default: 4 chars per token for English) to ensure chunks fit within model constraints.
import { TokenTextSplitter } from 'orkajs/splitters/token'; const splitter = new TokenTextSplitter({ chunkSize: 500, // Target tokens per chunk chunkOverlap: 50, // Overlap in tokens estimatedTokensPerChar: 0.25 // ~4 chars per token (English)}); const text = `Long document...`;const chunks = splitter.split(text); // Each chunk is approximately 500 tokens⚠️ Token Estimation
This splitter uses character-based estimation. For precise token counting, consider using a tokenizer library like tiktoken or gpt-tokenizer.
Comparison
| Splitter | Best For | Preserves |
|---|---|---|
| RecursiveCharacter | General text, articles, books | Paragraphs, sentences |
| Markdown | Documentation, READMEs | Headers, code blocks |
| Code | Source code files | Classes, functions |
| Token | LLM context limits | Token boundaries |
Complete Example
Here's a complete pipeline showing how to load, split, and index documents:
import { createOrka } from 'orkajs';import { MarkdownLoader } from 'orkajs/loaders/markdown';import { RecursiveCharacterTextSplitter } from 'orkajs/splitters/recursive'; const orka = createOrka({ /* config */ }); // 1. Load documentsconst loader = new MarkdownLoader('./docs/guide.md');const documents = await loader.load(); // 2. Split into chunksconst splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200});const chunks = splitter.splitDocuments(documents); // 3. Create knowledge baseawait orka.knowledge.create({ name: 'documentation', source: chunks.map(c => ({ text: c.content, metadata: c.metadata }))}); // 4. Queryconst result = await orka.ask({ knowledge: 'documentation', question: 'How do I configure Orka AI?'});Tree-shaking Imports
Import only what you need to minimize bundle size:
// ✅ Import only what you needimport { RecursiveCharacterTextSplitter } from 'orkajs/splitters/recursive';import { MarkdownTextSplitter } from 'orkajs/splitters/markdown'; // ✅ Or import from indeximport { RecursiveCharacterTextSplitter, CodeTextSplitter } from 'orkajs/splitters';