Document Chunker
Quick Install
No extra dependencies needed - DocumentChunker is ready to roll right out of the box for plain text! ๐
For document processing (PDFs, DOCX, EPUB, ODT, Excel, etc.), install the structured-document extra:
This installs all the document processing dependencies needed to handle PDFs, DOCX, EPUB, ODT, Excel, and more! ๐
Taming Your Text and Documents with Precision
Got a wall of text that's overwhelming? The DocumentChunker transforms unruly paragraphs into perfectly sized, context-aware chunks. Perfect for RAG systems and document analysis.
It preserves meaning and flow โ no confusing puzzle pieces.
Where DocumentChunker Really Shines
The DocumentChunker comes packed with smart features that make it your go-to text wrangling sidekick:
- Flexible Composable Constraints: Ultimate control over your chunks! Mix and match limits based on sentences, tokens, or section breaks (headings, horizontal rules,
<details>tags). Craft exactly the chunk size you need with precision control! ๐ฏ - Intelligent Overlap: Adds smart overlaps between chunks so your text flows smoothly. No more jarring transitions that leave readers scratching their heads!
- Extensive Multilingual Support: Speaks over 50 languages fluently, thanks to our trusty sentence splitter. Global domination through better text chunking! ๐
- Customizable Token Counting: Plug in your own token counter for perfect alignment with different LLMs. Because one size definitely doesn't fit all models!
- Memory-Conscious Operation: Handles massive documents efficiently by yielding chunks one at a time. Your RAM will thank you later! ๐พ
- Multi-Format Maestro: From corporate DOCX boardrooms to academic PDF libraries, this chunker speaks every file language fluently! Handles
.pdf,.docx,.epub,.txt,.tex,.html,.hml,.md,.rst,.rtf,.odt,.csv, and.xlsxfiles like a pro. ๐ - Metadata Magician: Not just text - it automatically enriches your chunks with valuable metadata. Your chunks come with bonus context! ๐
- Bulk Processing Powerhouse: Got a mountain of documents to conquer? No problem! This powerhouse efficiently processes multiple documents in parallel. ๐โก
- Pluggable Processor Power: Have a mysterious file format that's one-of-a-kind? Plug in your own custom processors -
DocumentChunkeris ready for any challenge you throw at it! ๐๐ ๏ธ
No Scanned PDF Support
Currently, DocumentChunker does not support scanned PDFs (images). It can only process PDFs with selectable/extractable text. For scanned documents, you'll need to OCR them first before chunking! ๐ท
Composable Constraints: Your Text, Your Rules!
DocumentChunker lets you call the shots with composable constraints. Mix and match limits to craft the perfect chunk size for your needs. Here's the constraint menu:
| Constraint | Value Requirement | Description |
|---|---|---|
max_sentences |
int >= 1 |
Sentence power mode! Tell us how many sentences per chunk, and we'll group them thoughtfully so your ideas flow like a well-written story. |
max_tokens |
int >= 12 |
Token budget watcher! We'll carefully pack sentences into chunks while respecting your token limits. If a sentence gets too chatty, we'll politely split it at clause boundaries. ๐ค |
max_section_breaks |
int >= 1 |
Structure superhero! Limits section breaks per chunk โ headings (##), horizontal rules (---, ***, ___), and <details> tags. Your document structure stays intact! |
Quick Note: Constraints Required!
You must specify at least one limit (max_sentences, max_tokens, or max_section_breaks) when using chunking methods. Forget to add one? You'll get an InvalidInputError!
The DocumentChunker has four main methods: chunk_text, chunk_file, chunk_texts, and chunk_files. chunk_text and chunk_file return a list of Box objects, while chunk_texts and chunk_files are memory-friendly generators that yield chunks one by one. Each Box has content (the actual text) and metadata (all the juicy details). Check the Metadata guide for the full scoop!
Single: Chunk One Text! ๐
Chunk a single string of text into manageable pieces using various constraints:
- chunk_text() - accepts raw text as a string
- chunk_file() - accepts a file path as a string or pathlib.Path object
Chunking by Sentences: Sentence Group Guru! ๐
Let's say you have this text:
# Introduction to Chunking
This is the first paragraph of our document. It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization. We aim to break down large documents into manageable, context-rich pieces.
## Why is Chunking Important?
Effective chunking helps in maintaining the semantic coherence of information. It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.
### Different Strategies
There are several strategies for chunking, including splitting by sentences, by a fixed number of tokens, or by structural elements like headings. Each method has its own advantages depending on the specific use case.
---
## Advanced Chunking Techniques
Ready to go beyond the basics? Let's explore some pro-level techniques!
### Overlap Considerations
Overlap is your secret weapon! It includes a bit of the previous chunk at the start of the next one, ensuring your text doesn't feel choppy or disconnected.
---
# Conclusion
In conclusion, mastering chunking is key to unlocking the full potential of your text data.
Now let's see how to chunk it:
- Initialize
DocumentChunker- no extra dependencies needed for plain text! lang="auto"lets us detect the language automatically. Super convenient, but specifying a known language likelang="en"can boost accuracy and speed.overlap_percent=0means no overlap between chunks. By default, we add 20% overlap to keep your text flowing smoothly across chunks.offset=0starts us from the very beginning of the text. (Zero-based indexing - because programmers love starting from zero!)
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (1, 73)}
Content: # Introduction to Chunking
This is the first paragraph of our document.
--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (74, 259)}
Content: It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization.
We aim to break down large documents into manageable, context-rich pieces.
## Why is Chunking Important?
Effective chunking helps in maintaining the semantic coherence of information.
--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (261, 370)}
Content: It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.
### Different Strategies
--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (371, 529)}
Content: There are several strategies for chunking, including splitting by sentences, by a fixed number of tokens, or by structural elements like headings.
Each method has its own advantages depending on the specific use case.
--- Chunk 5 ---
Metadata: {'chunk_num': 5, 'span': (531, 748)}
Content: ---
## Advanced Chunking Techniques
--- Chunk 6 ---
Metadata: {'chunk_num': 6, 'span': (750, 787)}
Content: Ready to go beyond the basics?
Let's explore some pro-level techniques!
--- Chunk 7 ---
Metadata: {'chunk_num': 7, 'span': (788, 859)}
Content: ### Overlap Considerations
Overlap is your secret weapon!
--- Chunk 8 ---
Metadata: {'chunk_num': 8, 'span': (861, 919)}
Content: ### Overlap Considerations
Overlap is your secret weapon!
--- Chunk 9 ---
Metadata: {'chunk_num': 9, 'span': (920, 1050)}
Content: It includes a bit of the previous chunk at the start of the next one, ensuring your text doesn't feel choppy or disconnected.
---
--- Chunk 10 ---
Metadata: {'chunk_num': 10, 'span': (1052, 1157)}
Content: # Conclusion
In conclusion, mastering chunking is key to unlocking the full potential of your text data.
Chunking by Tokens: Token Budget Master! ๐ช
Token Counter Requirement
When using the max_tokens constraint, a token_counter function is essential. This function, which you provide, should accept a string and return an integer representing its token count. Failing to provide a token_counter will result in a MissingTokenCounterError.
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (1, 290)}
Content: # Introduction to Chunking
This is the first paragraph of our document.
It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization.
We aim to break down large documents into manageable, context-rich pieces.
## Why is Chunking Important?
--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (261, 573)}
Content: ...
## Why is Chunking Important?
Effective chunking helps in maintaining the semantic coherence of information.
It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.
### Different Strategies
There are several strategies for chunking,
--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (531, 859)}
Content: There are several strategies for chunking,
including splitting by sentences, by a fixed number of tokens, or by structural elements like headings.
Each method has its own advantages depending on the specific use case.
---
## Advanced Chunking Techniques
Ready to go beyond the basics?
Let's explore some pro-level techniques!
--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (819, 1080)}
Content: Let's explore some pro-level techniques!
### Overlap Considerations
Overlap is your secret weapon!
It includes a bit of the previous chunk at the start of the next one, ensuring your text doesn't feel choppy or disconnected.
---
# Conclusion
In conclusion,
--- Chunk 5 ---
Metadata: {'chunk_num': 5, 'span': (1052, 1157)}
Content: ...
# Conclusion
In conclusion,
mastering chunking is key to unlocking the full potential of your text data.
Overrides token_counter
You can also provide the token_counter directly to any chunking method. If provided in both the constructor and the method, the one in the method will be used.
Chunking by Section Breaks: Structure Superhero! ๐ฆธโโ๏ธ
This constraint is useful for documents structured with Markdown headings or thematic breaks.
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (1, 503)}
Content: # Introduction to Chunking
This is the first paragraph of our document.
It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization.
We aim to break down large documents into manageable, context-rich pieces.
## Why is Chunking Important?
Effective chunking helps in maintaining the semantic coherence of information.
It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.
--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (371, 753)}
Content: It ensures that each piece of text retains enough context to be meaningful on its own,
which is crucial for downstream applications.
### Different Strategies
There are several strategies for chunking, including splitting by sentences, by a fixed number of tokens, or by structural elements like headings.
Each method has its own advantages depending on the specific use case.
---
--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (678, 859)}
Content: Each method has its own advantages depending on the specific use case.
---
## Advanced Chunking Techniques
Ready to go beyond the basics?
Let's explore some pro-level techniques!
--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (819, 1050)}
Content: Let's explore some pro-level techniques!
### Overlap Considerations
Overlap is your secret weapon!
It includes a bit of the previous chunk at the start of the next one, ensuring your text doesn't feel choppy or disconnected.
---
--- Chunk 5 ---
Metadata: {'chunk_num': 5, 'span': (1047, 1157)}
Content: ...
---
# Conclusion
In conclusion, mastering chunking is key to unlocking the full potential of your text data.
Combining Multiple Constraints: Mix and Match Magic! ๐ญ
The real power of DocumentChunker comes from combining multiple constraints. The chunking will stop as soon as any of the limits is reached.
Customizing the Continuation Marker
You can customize the continuation marker, which is prepended to clauses that don't fit in the previous chunk. To do this, pass the continuation_marker parameter to the chunker's constructor.
If you don't want any continuation marker, you can set it to an empty string:
Enable Verbose Logging
To see detailed logging during the chunking process, you can set the verbose parameter to True when initializing the DocumentChunker:
Custom Sentence Splitter
You can provide a custom SentenceSplitter instance to DocumentChunker for specialized sentence splitting behavior. For more details, see the Sentence Splitter documentation.
Adding Base Metadata
You can pass a base_metadata dictionary to chunk_text and chunk_texts. This metadata will be included in each chunk. For example: chunker.chunk_text(..., base_metadata={"source": "my_document.txt"}). For more details, see the Metadata guide.
Single File: Process One Document! ๐
While chunk_text is perfect for plain text, chunk_file handles document files. It supports the same constraints (max_sentences, max_tokens, max_section_breaks, etc.).
Special Handling for Streaming Processors
Some processors work differently due to their streaming nature - they yield content page by page or in blocks rather than all at once. This means they require special care:
Streaming processors (PDF, EPUB, DOCX, ODT): These beauties process content as they go, so they're designed for chunk_files method. Using them with chunk_file will throw a FileProcessingError since chunk_file expects all content upfront.
Regular processors work fine with both chunk_file and chunk_files methods.
Batch: Chunk Multiple Items! ๐
While chunk_text is perfect for single texts and chunk_file for single files, chunk_texts and chunk_files are your power players for processing multiple texts or files in parallel. They use memory-friendly generators so you can handle massive collections with ease.
chunk_texts()- process multiple raw text stringschunk_files()- process multiple file paths
For Texts
- Specifies the number of parallel processes to use for chunking. The default value is
None(use all available CPU cores). - Define how to handle errors during processing. If set to
"raise"(default), an exception will be raised immediately. If set to"break", the process will halt and partial result will be returned. If set to"ignore", errors will be silently ignored. - Display a progress bar during batch processing. The default value is
False.
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (0, 82)}
Content: Este es el primer documento.
Contiene varias frases para la segmentaciรณn de texto.
--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (83, 196)}
Content: El segundo ejemplo es mรกs extenso para probar el procesamiento por lotes.
La tercera oraciรณn aรฑade mรกs contenido.
--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (197, 236)}
Content: Y la cuarta oraciรณn para mayor medida.
--- Chunk 4 ---
Metadata: {'chunk_num': 1, 'span': (0, 96)}
Content: This is the first document.
It has multiple sentences for chunking.
Here is the second document.
--- Chunk 5 ---
Metadata: {'chunk_num': 2, 'span': (97, 201)}
Content: It is a bit longer to test batch processing effectively.
This is the third sentence to add more content.
--- Chunk 6 ---
Metadata: {'chunk_num': 3, 'span': (202, 294)}
Content: And the fourth sentence for good measure.
The fifth sentence makes it even more interesting.
For Files
Generator Cleanup
When using chunk_texts, it's crucial to ensure the generator is properly closed, especially if you don't iterate through all the chunks. This is necessary to release the underlying multiprocessing resources. The recommended way is to use a try...finally block to call close() on the generator. For more details, see the Troubleshooting guide.
Separator: Keeping Your Batches Organized! ๐
The separator parameter works for both chunk_texts and chunk_files. It lets you add a custom marker that gets yielded after all chunks from a single input are processed. Super handy for batch processing when you want to clearly separate chunks from different source texts.
Quick Note
None won't work as a separator - you'll need something more substantial!
- Avoid processing the empty list at the end if stream ends with separator
Click to show output
--- Chunks for Document 1 ---
Content: This is the first document.
Metadata: {'chunk_num': 1, 'span': (0, 27)}
Content: It has two sentences.
Metadata: {'chunk_num': 2, 'span': (28, 49)}
--- Chunks for Document 2 ---
Content: This is the second document.
Metadata: {'chunk_num': 1, 'span': (0, 28)}
Content: It also has two sentences.
Metadata: {'chunk_num': 2, 'span': (29, 55)}
Custom processors: Build Your Own Document Wizards! ๐ ๏ธ๐ฎ
Want to handle exotic file formats that DocumentChunker doesn't know about? Create your own custom processors! This lets you add specialized processing for any file type and prioritize your custom processors over the built-in ones.
Global Registry Alert!
Custom processors get registered globally - once you add one, it's available everywhere in your app. Watch out for side effects if you're registering processors across different parts of your codebase, especially in multi-threaded or long-running applications!
To use a custom processor, you leverage the @custom_processor_registry.register decorator. This decorator allows you to register your function for one or more file extensions directly. Your custom processor function must accept a single file_path parameter (str) and return a tuple[str | list[str], dict] containing extracted text (or list of texts for multi-section documents) and a metadata dictionary.
Custom Processor Rules
- Your function must accept exactly one required parameter (the file path)
- Optional parameters with defaults are totally fine
- File extensions must start with a dot (like
.json,.custom) - Lambda functions are not supported unless you provide a
nameparameter - The metadata dictionary will be merged with common metadata (chunk_num, span, source)
- For multi-section documents, return a list of strings - each will be processed as a separate section
- If an error occurs during the document processing (e.g., an issue with the custom processor function), a
CallbackErrorwill be raised
Click to show output
--- Chunk 1 ---
Content:
This is the first paragraph of our longer JSON sample.
It contains multiple sentences to test the chunking process.
The second paragraph introduces a new topic.
We are exploring the capabilities of custom processors in the chunklet library.
Finally, the third paragraph concludes our sample.
Metadata:
{'document_id': 'doc-12345', 'created_at': '2025-11-05', 'source': '/tmp/tmpdt6xa5rh.json', 'chunk_num': 1, 'span': (0, 292)}
--- Chunk 2 ---
Content:
... the third paragraph concludes our sample.
Metadata:
{'document_id': 'doc-12345', 'created_at': '2025-11-05', 'source': '/tmp/tmpdt6xa5rh.json', 'chunk_num': 2, 'span': (250, 292)}
Registering Without the Decorator
If you prefer not to use decorators, you can directly use the custom_processor_registry.register() method. This is particularly useful when registering processors dynamically.
custom_processor_registry Methods Summary
processors: Returns a shallow copy of the dictionary of registered processors.is_registered(ext: str): Checks if a processor is registered for the given file extension, returningTrueorFalse.register(callback: Callable[[str], ReturnType] | None = None, *exts: str, name: str | None = None): Registers a processor callback for one or more file extensions.unregister(*exts: str): Removes processor(s) from the registry.clear(): Clears all registered processors from the registry.extract_data(file_path: str, ext: str): Processes a file using a registered processor, returning the extracted data and the name of the processor used.
API Reference
For complete technical details on the DocumentChunker class, check out the API documentation.