OrkaJS
Orka.JS

Text Splitters

Split documents into chunks optimized for embeddings and retrieval with intelligent strategies.

Why Split Text?

Large documents need to be split into smaller chunks for effective semantic search and context injection. Good splitting preserves meaning and respects document structure.

🎯

Optimal Chunk Size

500-1000 chars for Q&A, 1500-2000 for summarization

🔗

Chunk Overlap

10-20% overlap preserves context at boundaries

📐

Structure Aware

Split on paragraphs, sentences, not mid-word

# RecursiveCharacterTextSplitter

The most versatile splitter. Uses hierarchical separators to split text while preserving semantic boundaries. Tries to split on paragraphs first, then sentences, then words, and finally characters as a last resort.

import { RecursiveCharacterTextSplitter } from 'orkajs/splitters/recursive';
 
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // Target chunk size in characters
chunkOverlap: 200, // Overlap between chunks
separators: ['\n\n', '\n', '. ', ' ', ''], // Try these in order
keepSeparator: true, // Keep separators in chunks
trimWhitespace: true // Remove leading/trailing whitespace
});
 
const text = `Long document content here...`;
const chunks = splitter.split(text);
 
// Or split multiple documents
const documents = [
{ id: '1', content: 'Doc 1...', metadata: {} },
{ id: '2', content: 'Doc 2...', metadata: {} }
];
const allChunks = splitter.splitDocuments(documents);

🎯 How It Works

  1. Tries to split on double newlines (paragraphs)
  2. If chunks are still too large, tries single newlines
  3. Then sentences (periods), then words (spaces)
  4. Finally characters as last resort
  5. Maintains overlap for context continuity

# MarkdownTextSplitter

Specialized splitter for Markdown that respects document structure: headers, code blocks, and lists. Perfect for documentation where maintaining the hierarchy and code examples is crucial.

import { MarkdownTextSplitter } from 'orkajs/splitters/markdown';
 
const splitter = new MarkdownTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
 
const markdown = `
## Introduction
 
This is a paragraph...
 
### Subsection
 
More content here...
 
\`\`\`typescript
const code = 'example';
\`\`\`
`;
 
const chunks = splitter.split(markdown);
// Splits at headers, preserving structure
SeparatorPriorityDescription
\\n## 1H2 headers
\\n### 2H3 headers
\\n\`\`\`\\n3Code blocks
\\n---\\n4Horizontal rules
\\n\\n5Paragraphs

# CodeTextSplitter

Language-aware splitter that respects code structure: classes, functions, and blocks. Uses language-specific separators to split at natural boundaries like function definitions, class declarations, and import statements.

import { CodeTextSplitter } from 'orkajs/splitters/code';
 
const splitter = new CodeTextSplitter({
language: 'typescript', // or 'python', 'javascript', 'java', etc.
chunkSize: 1000,
chunkOverlap: 200
});
 
const code = `
export class MyClass {
constructor() {}
 
method1() {
// implementation
}
 
method2() {
// implementation
}
}
 
export function helperFunction() {
// implementation
}
`;
 
const chunks = splitter.split(code);
// Splits at class/function boundaries

Supported Languages

• TypeScript
• JavaScript
• Python
• Java
• Go
• Rust
• C++
• HTML
• CSS

# TokenTextSplitter

Split text based on estimated token count, useful for staying within LLM context limits. Uses a character-to-token ratio estimation (default: 4 chars per token for English) to ensure chunks fit within model constraints.

import { TokenTextSplitter } from 'orkajs/splitters/token';
 
const splitter = new TokenTextSplitter({
chunkSize: 500, // Target tokens per chunk
chunkOverlap: 50, // Overlap in tokens
estimatedTokensPerChar: 0.25 // ~4 chars per token (English)
});
 
const text = `Long document...`;
const chunks = splitter.split(text);
 
// Each chunk is approximately 500 tokens

⚠️ Token Estimation

This splitter uses character-based estimation. For precise token counting, consider using a tokenizer library like tiktoken or gpt-tokenizer.

Comparison

SplitterBest ForPreserves
RecursiveCharacterGeneral text, articles, booksParagraphs, sentences
MarkdownDocumentation, READMEsHeaders, code blocks
CodeSource code filesClasses, functions
TokenLLM context limitsToken boundaries

Complete Example

Here's a complete pipeline showing how to load, split, and index documents:

import { createOrka } from 'orkajs';
import { MarkdownLoader } from 'orkajs/loaders/markdown';
import { RecursiveCharacterTextSplitter } from 'orkajs/splitters/recursive';
 
const orka = createOrka({ /* config */ });
 
// 1. Load documents
const loader = new MarkdownLoader('./docs/guide.md');
const documents = await loader.load();
 
// 2. Split into chunks
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
const chunks = splitter.splitDocuments(documents);
 
// 3. Create knowledge base
await orka.knowledge.create({
name: 'documentation',
source: chunks.map(c => ({
text: c.content,
metadata: c.metadata
}))
});
 
// 4. Query
const result = await orka.ask({
knowledge: 'documentation',
question: 'How do I configure Orka AI?'
});

Tree-shaking Imports

Import only what you need to minimize bundle size:

// ✅ Import only what you need
import { RecursiveCharacterTextSplitter } from 'orkajs/splitters/recursive';
import { MarkdownTextSplitter } from 'orkajs/splitters/markdown';
 
// ✅ Or import from index
import { RecursiveCharacterTextSplitter, CodeTextSplitter } from 'orkajs/splitters';