Plain Text Chunker
Quick Install
No extra dependencies needed - PlainTextChunker is ready to roll right out of the box! π
Taming Your Text with Precision
Got a wall of text that's feeling a bit... overwhelming? The PlainTextChunker is your friendly neighborhood text organizer that transforms unruly paragraphs into perfectly sized, context-aware chunks. Perfect for RAG systems, document analysis, and any workflow that needs smart text segmentation with full control over chunk sizes.
Forget dumb splitting - we're talking intelligent segmentation that actually understands context! The PlainTextChunker works hard to preserve meaning and flow, so your chunks don't end up as confusing puzzle pieces.
Ready to bring some order to the chaos? Let's dive in and make your text behave!
Where PlainTextChunker Really Shines
The PlainTextChunker comes packed with smart features that make it your go-to text wrangling sidekick:
- Flexible Constraint-Based Chunking: Ultimate control over your chunks! Mix and match limits based on sentences, tokens, or Markdown section breaks. Craft exactly the chunk size you need with precision control! π―
- Intelligent Overlap for Context Preservation: Adds smart overlaps between chunks so your text flows smoothly. No more jarring transitions that leave readers scratching their heads!
- Extensive Multilingual Support: Speaks over 50 languages fluently, thanks to our trusty sentence splitter. Global domination through better text chunking! π
- Customizable Token Counting: Plug in your own token counter for perfect alignment with different LLMs. Because one size definitely doesn't fit all models!
- Optimized Parallel Processing: Turbocharges through large texts using multiple processors. Speed demon mode activated! β‘
- Memory-Conscious Operation: Handles massive documents efficiently by yielding chunks one at a time. Your RAM will thank you later! πΎ
Constraint-Based Chunking: Your Text, Your Rules!
PlainTextChunker lets you call the shots with constraint-based chunking. Mix and match limits to craft the perfect chunk size for your needs. Here's the constraint menu:
| Constraint | Value Requirement | Description |
|---|---|---|
max_sentences |
int >= 1 |
Sentence power mode! Tell us how many sentences per chunk, and we'll group them thoughtfully so your ideas flow like a well-written story. |
max_tokens |
int >= 12 |
Token budget watcher! We'll carefully pack sentences into chunks while respecting your token limits. If a sentence gets too chatty, we'll politely split it at clause boundaries. π€ |
max_section_breaks |
int >= 1 |
Structure superhero! Limits Markdown section breaks per chunk (headings ##, rules ---) to keep your document's organization intact. Your headings stay where they belong! |
The PlainTextChunker has two main methods: chunk for single texts and batch_chunk for processing multiple texts at once. chunk returns a list of handy Box objects, while batch_chunk is a memory-friendly generator that yields chunks one by one. Each Box has content (the actual text) and metadata (all the juicy details). Check the Metadata guide for the full scoop!
Quick Note: Constraints Required!
You must specify at least one limit (like max_sentences, max_tokens, or max_section_breaks) when using chunk or batch_chunk. Forget to add one? You'll get an InvalidInputError - but don't worry, it's an easy fix!
Single Run:
For our examples, we'll use this sample text:
Chunking by Sentences: Sentence Power Mode! π
Ready to chunk by sentence count? This is perfect when you want predictable, idea-focused chunks. Let's see it in action:
verbose=Trueturns on the chatty mode - you'll see detailed logging of internal processes like language detection. (Default is quiet mode!)lang="auto"lets us detect the language automatically. Super convenient, but specifying a known language likelang="en"can boost accuracy and speed.overlap_percent=0means no overlap between chunks. By default, we add 20% overlap to keep your text flowing smoothly across chunks.offset=0starts us from the very beginning of the text. (Zero-based indexing - because programmers love starting from zero!)
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (0, 73)}
Content: # Introduction to Chunking
This is the first paragraph of our document.
--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (74, 259)}
Content: It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization.
We aim to break down large documents into manageable, context-rich pieces.
--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (260, 370)}
Content: ## Why is Chunking Important?
Effective chunking helps in maintaining the semantic coherence of information.
--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (371, 529)}
Content: It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.
### Different Strategies
--- Chunk 5 ---
Metadata: {'chunk_num': 5, 'span': (531, 748)}
Content: There are several strategies for chunking, including splitting by sentences, by a fixed number of tokens, or by structural elements like headings.
Each method has its own advantages depending on the specific use case.
--- Chunk 6 ---
Metadata: {'chunk_num': 6, 'span': (749, 786)}
Content: ---
## Advanced Chunking Techniques
--- Chunk 7 ---
Metadata: {'chunk_num': 7, 'span': (788, 995)}
Content: Beyond basic splitting, advanced techniques involve understanding the document's structure.
For instance, preserving section breaks can significantly improve the quality of chunks for hierarchical documents.
--- Chunk 8 ---
Metadata: {'chunk_num': 8, 'span': (996, 1066)}
Content: This section will delve into such methods.
### Overlap Considerations
--- Chunk 9 ---
Metadata: {'chunk_num': 9, 'span': (1068, 1264)}
Content: To ensure smooth transitions between chunks, an overlap mechanism is often employed.
This means that a portion of the previous chunk is included in the beginning of the next, providing continuity.
--- Chunk 10 ---
Metadata: {'chunk_num': 10, 'span': (749, 763)}
Content: ---
# Conclusion
--- Chunk 11 ---
Metadata: {'chunk_num': 11, 'span': (1285, 1459)}
Content: In conclusion, mastering chunking is key to unlocking the full potential of your text data.
Experiment with different constraints to find the optimal strategy for your needs.
Enable Verbose Logging
To see detailed logging during the chunking process, you can set the verbose parameter to True when initializing the DocumentChunker:
Chunking by Tokens: Token Budget Master! πͺ
Token Counter Requirement
When using the max_tokens constraint, a token_counter function is essential. This function, which you provide, should accept a string and return an integer representing its token count. Failing to provide a token_counter will result in a MissingTokenCounterError.
- Initializes
PlainTextChunkerwith a customword_counterfunction. This function will be used to count tokens whenmax_tokensis used.
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (0, 291)}
Content: # Introduction to Chunking
This is the first paragraph of our document.
It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization.
We aim to break down large documents into manageable, context-rich pieces.
## Why is Chunking Important?
--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (256, 573)}
Content: ...
## Why is Chunking Important?
Effective chunking helps in maintaining the semantic coherence of information.
It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.
### Different Strategies
There are several strategies for chunking,
--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (531, 880)}
Content: There are several strategies for chunking,
including splitting by sentences, by a fixed number of tokens, or by structural elements like headings.
Each method has its own advantages depending on the specific use case.
---
## Advanced Chunking Techniques
Beyond basic splitting, advanced techniques involve understanding the document's structure.
--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (808, 1153)}
Content: ... advanced techniques involve understanding the document's structure.
For instance, preserving section breaks can significantly improve the quality of chunks for hierarchical documents.
This section will delve into such methods.
### Overlap Considerations
To ensure smooth transitions between chunks, an overlap mechanism is often employed.
--- Chunk 5 ---
Metadata: {'chunk_num': 5, 'span': (1109, 1377)}
Content: ... an overlap mechanism is often employed.
This means that a portion of the previous chunk is included in the beginning of the next, providing continuity.
---
# Conclusion
In conclusion, mastering chunking is key to unlocking the full potential of your text data.
--- Chunk 6 ---
Metadata: {'chunk_num': 6, 'span': (1296, 1459)}
Content: ... mastering chunking is key to unlocking the full potential of your text data.
Experiment with different constraints to find the optimal strategy for your needs.
Overrides token_counter
You can also provide the token_counter directly to the chunk method. within the chunk method call (e.g., chunker.chunk(..., token_counter=my_tokenizer_function)). If a token_counter is provided in both the constructor and the chunk method, the one in the chunk method will be used.
Chunking by Section Breaks: Structure Superhero! π¦ΈββοΈ
This constraint is useful for documents structured with Markdown headings or thematic breaks.
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (0, 503)}
Content: # Introduction to Chunking
This is the first paragraph of our document.
It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization.
We aim to break down large documents into manageable, context-rich pieces.
## Why is Chunking Important?
Effective chunking helps in maintaining the semantic coherence of information.
It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.
--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (371, 753)}
Content: It ensures that each piece of text retains enough context to be meaningful on its own,
which is crucial for downstream applications.
### Different Strategies
There are several strategies for chunking, including splitting by sentences, by a fixed number of tokens, or by structural elements like headings.
Each method has its own advantages depending on the specific use case.
---
--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (678, 1038)}
Content: Each method has its own advantages depending on the specific use case.
---
## Advanced Chunking Techniques
Beyond basic splitting, advanced techniques involve understanding the document's structure.
For instance, preserving section breaks can significantly improve the quality of chunks for hierarchical documents.
This section will delve into such methods.
--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (890, 1269)}
Content: ... preserving section breaks can significantly improve the quality of chunks for hierarchical documents.
This section will delve into such methods.
### Overlap Considerations
To ensure smooth transitions between chunks, an overlap mechanism is often employed.
This means that a portion of the previous chunk is included in the beginning of the next, providing continuity.
---
--- Chunk 5 ---
Metadata: {'chunk_num': 5, 'span': (1239, 1459)}
Content: ... providing continuity.
---
# Conclusion
In conclusion, mastering chunking is key to unlocking the full potential of your text data.
Experiment with different constraints to find the optimal strategy for your needs.
Adding Base Metadata
You can pass a base_metadata dictionary to the chunk method. This metadata will be included in the metadata of each chunk. For example: chunker.chunk(..., base_metadata={"source": "my_document.txt"}). For more details on metadata structure and available fields, see the Metadata guide.
Combining Multiple Constraints: Mix and Match Magic! π
The real power of PlainTextChunker comes from combining multiple constraints. This allows for highly specific and granular control over how your text is chunked. Here are a few examples of how you can combine different constraints.
Token Counter Requirement
Remember, whenever you use the max_tokens constraint, you must provide a token_counter function.
By Sentences and Tokens
This is useful when you want to limit by both the number of sentences and the overall token count, whichever is reached first.
By Sentences and Section Breaks
This combination is great for ensuring that chunks don't span across too many sections while also keeping the sentence count in check.
By Tokens and Section Breaks
A powerful combination for structured documents where you want to respect section boundaries while adhering to a strict token budget.
By Sentences, Tokens, and Section Breaks
For the ultimate level of control, you can combine all three constraints. The chunking will stop as soon as any of the three limits is reached.
Customizing the Continuation Marker
You can customize the continuation marker, which is prepended to clauses that don't fit in the previous chunk. To do this, pass the continuation_marker parameter to the chunker's constructor.
If you don't want any continuation marker, you can set it to an empty string:
Batch Run: Processing Multiple Texts Like a Pro! π
While chunk is perfect for single texts, batch_chunk is your power player for processing multiple texts in parallel. It uses a memory-friendly generator so you can handle massive text collections with ease. It shares most arguments with chunk (like max_sentences, max_tokens, lang, etc.), plus some extra parameters for batch management.
Here's an example of how to use batch_chunk:
- Specifies the number of parallel processes to use for chunking. The default value is
None(use all available CPU cores). - Define how to handle errors during processing. Determines how errors during chunking are handled. If set to
"raise"(default), an exception will be raised immediately. If set to"break", the process will be halt and partial result will be returned. If set to"ignore", errors will be silently ignored. - Display a progress bar during batch processing. The default value is
False.
Click to show output
0%| | 0/3 [00:00<?, ?it/s]
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (0, 97)}
Content: This is the first document.
It has multiple sentences for chunking.
Here is the second document.
--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (96, 202)}
Content: It is a bit longer to test batch processing effectively.
And this is the third document.
Short and sweet,
--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (186, 253)}
Content: Short and sweet,
but still part of the batch.
The fourth document.
--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (252, 311)}
Content: Another one to add to the collection for testing purposes.
--- Chunk 5 ---
Metadata: {'chunk_num': 1, 'span': (0, 118)}
Content: Este es el primer documento.
Contiene varias frases para la segmentaciΓ³n de texto.
El segundo ejemplo es mΓ‘s extenso.
--- Chunk 6 ---
Metadata: {'chunk_num': 2, 'span': (117, 173)}
Content: Queremos probar el procesamiento en diferentes idiomas.
Chunking ...: 67%|ββββββββββββββββββββββββββ | 2/3 [00:00, 10.09it/s]
--- Chunk 7 ---
Metadata: {'chunk_num': 1, 'span': (0, 125)}
Content: Ceci est le premier document.
Il est essentiel pour l'Γ©valuation multilingue.
Le deuxième document est court mais important.
--- Chunk 8 ---
Metadata: {'chunk_num': 2, 'span': (125, 149)}
Content: La variation est la clΓ©.
Chunking ...: 100%|ββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00, 19.88it/s]
Generator Cleanup
When using batch_chunk, it's crucial to ensure the generator is properly closed, especially if you don't iterate through all the chunks. This is necessary to release the underlying multiprocessing resources. The recommended way is to use a try...finally block to call close() on the generator. For more details, see the Troubleshooting guide.
Adding Base Metadata to Batches
Just like with the chunk method, you can pass a base_metadata dictionary to batch_chunk. This is useful for adding common information, like a source filename, to all chunks processed in the batch. For more details on metadata structure and available fields, see the Metadata guide.
Separator: Keeping Your Batches Organized! π
The separator parameter lets you add a custom marker that gets yielded after all chunks from a single text are processed. Super handy for batch processing when you want to clearly separate chunks from different source texts.
Quick Note
None won't work as a separator - you'll need something more substantial!
- Avoid processing the empty list at the end if stream ends with separator
Click to show output
--- Chunks for Document 1 ---
Content: This is the first document.
Metadata: {'chunk_num': 1, 'span': (0, 27)}
Content: It has two sentences.
Metadata: {'chunk_num': 2, 'span': (28, 49)}
--- Chunks for Document 2 ---
Content: This is the second document.
Metadata: {'chunk_num': 1, 'span': (0, 28)}
Content: It also has two sentences.
Metadata: {'chunk_num': 2, 'span': (29, 55)}
API Reference
For complete technical details on the PlainTextChunker class, check out the API documentation.````