Skip to content

Plain Text Chunker

Plain Text Chunker

Quick Install

pip install chunklet-py

No extra dependencies needed - PlainTextChunker is ready to roll right out of the box! πŸš€

Taming Your Text with Precision

Got a wall of text that's feeling a bit... overwhelming? The PlainTextChunker is your friendly neighborhood text organizer that transforms unruly paragraphs into perfectly sized, context-aware chunks. Perfect for RAG systems, document analysis, and any workflow that needs smart text segmentation with full control over chunk sizes.

Forget dumb splitting - we're talking intelligent segmentation that actually understands context! The PlainTextChunker works hard to preserve meaning and flow, so your chunks don't end up as confusing puzzle pieces.

Ready to bring some order to the chaos? Let's dive in and make your text behave!

Where PlainTextChunker Really Shines

The PlainTextChunker comes packed with smart features that make it your go-to text wrangling sidekick:

  • Flexible Constraint-Based Chunking: Ultimate control over your chunks! Mix and match limits based on sentences, tokens, or Markdown section breaks. Craft exactly the chunk size you need with precision control! 🎯
  • Intelligent Overlap for Context Preservation: Adds smart overlaps between chunks so your text flows smoothly. No more jarring transitions that leave readers scratching their heads!
  • Extensive Multilingual Support: Speaks over 50 languages fluently, thanks to our trusty sentence splitter. Global domination through better text chunking! 🌍
  • Customizable Token Counting: Plug in your own token counter for perfect alignment with different LLMs. Because one size definitely doesn't fit all models!
  • Optimized Parallel Processing: Turbocharges through large texts using multiple processors. Speed demon mode activated! ⚑
  • Memory-Conscious Operation: Handles massive documents efficiently by yielding chunks one at a time. Your RAM will thank you later! πŸ’Ύ

Constraint-Based Chunking: Your Text, Your Rules!

PlainTextChunker lets you call the shots with constraint-based chunking. Mix and match limits to craft the perfect chunk size for your needs. Here's the constraint menu:

Constraint Value Requirement Description
max_sentences int >= 1 Sentence power mode! Tell us how many sentences per chunk, and we'll group them thoughtfully so your ideas flow like a well-written story.
max_tokens int >= 12 Token budget watcher! We'll carefully pack sentences into chunks while respecting your token limits. If a sentence gets too chatty, we'll politely split it at clause boundaries. 🀐
max_section_breaks int >= 1 Structure superhero! Limits Markdown section breaks per chunk (headings ##, rules ---) to keep your document's organization intact. Your headings stay where they belong!

The PlainTextChunker has two main methods: chunk for single texts and batch_chunk for processing multiple texts at once. chunk returns a list of handy Box objects, while batch_chunk is a memory-friendly generator that yields chunks one by one. Each Box has content (the actual text) and metadata (all the juicy details). Check the Metadata guide for the full scoop!

Quick Note: Constraints Required!

You must specify at least one limit (like max_sentences, max_tokens, or max_section_breaks) when using chunk or batch_chunk. Forget to add one? You'll get an InvalidInputError - but don't worry, it's an easy fix!

Single Run:

For our examples, we'll use this sample text:

text = """
# Introduction to Chunking

This is the first paragraph of our document. It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization. We aim to break down large documents into manageable, context-rich pieces.

## Why is Chunking Important?

Effective chunking helps in maintaining the semantic coherence of information. It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.

### Different Strategies

There are several strategies for chunking, including splitting by sentences, by a fixed number of tokens, or by structural elements like headings. Each method has its own advantages depending on the specific use case.

---

## Advanced Chunking Techniques: Level Up Your Skills!

Ready to go beyond the basics? Let's explore some pro-level techniques that make your chunks even smarter!

### Overlap Considerations: Keeping Things Connected

Want your chunks to flow smoothly like a well-told story? Overlap is your secret weapon! It includes a bit of the previous chunk at the start of the next one, ensuring your text doesn't feel choppy or disconnected. Continuity is key!

---

# Conclusion

In conclusion, mastering chunking is key to unlocking the full potential of your text data. Experiment with different constraints to find the optimal strategy for your needs.
"""

Chunking by Sentences: Sentence Power Mode! πŸ“

Ready to chunk by sentence count? This is perfect when you want predictable, idea-focused chunks. Let's see it in action:

from chunklet.plain_text_chunker import PlainTextChunker

chunker = PlainTextChunker()  # (1)!

chunks = chunker.chunk(
    text=text,
    lang="auto",             # (2)!
    max_sentences=2,
    overlap_percent=0,       # (3)!
    offset=0                 # (4)!
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(f"Metadata: {chunk.metadata}")
    print(f"Content: {chunk.content}")
    print()
  1. verbose=True turns on the chatty mode - you'll see detailed logging of internal processes like language detection. (Default is quiet mode!)
  2. lang="auto" lets us detect the language automatically. Super convenient, but specifying a known language like lang="en" can boost accuracy and speed.
  3. overlap_percent=0 means no overlap between chunks. By default, we add 20% overlap to keep your text flowing smoothly across chunks.
  4. offset=0 starts us from the very beginning of the text. (Zero-based indexing - because programmers love starting from zero!)
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (0, 73)}
Content: # Introduction to Chunking
This is the first paragraph of our document.

--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (74, 259)}
Content: It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization.
We aim to break down large documents into manageable, context-rich pieces.

--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (260, 370)}
Content: ## Why is Chunking Important?
Effective chunking helps in maintaining the semantic coherence of information.

--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (371, 529)}
Content: It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.

### Different Strategies

--- Chunk 5 ---
Metadata: {'chunk_num': 5, 'span': (531, 748)}
Content: There are several strategies for chunking, including splitting by sentences, by a fixed number of tokens, or by structural elements like headings.
Each method has its own advantages depending on the specific use case.

--- Chunk 6 ---
Metadata: {'chunk_num': 6, 'span': (749, 786)}
Content: ---

## Advanced Chunking Techniques

--- Chunk 7 ---
Metadata: {'chunk_num': 7, 'span': (788, 995)}
Content: Beyond basic splitting, advanced techniques involve understanding the document's structure.
For instance, preserving section breaks can significantly improve the quality of chunks for hierarchical documents.

--- Chunk 8 ---
Metadata: {'chunk_num': 8, 'span': (996, 1066)}
Content: This section will delve into such methods.

### Overlap Considerations

--- Chunk 9 ---
Metadata: {'chunk_num': 9, 'span': (1068, 1264)}
Content: To ensure smooth transitions between chunks, an overlap mechanism is often employed.
This means that a portion of the previous chunk is included in the beginning of the next, providing continuity.

--- Chunk 10 ---
Metadata: {'chunk_num': 10, 'span': (749, 763)}
Content: ---

# Conclusion

--- Chunk 11 ---
Metadata: {'chunk_num': 11, 'span': (1285, 1459)}
Content: In conclusion, mastering chunking is key to unlocking the full potential of your text data.
Experiment with different constraints to find the optimal strategy for your needs.

Enable Verbose Logging

To see detailed logging during the chunking process, you can set the verbose parameter to True when initializing the DocumentChunker:

chunker = PlainTextChunker(verbose=True)

Chunking by Tokens: Token Budget Master! πŸͺ™

Token Counter Requirement

When using the max_tokens constraint, a token_counter function is essential. This function, which you provide, should accept a string and return an integer representing its token count. Failing to provide a token_counter will result in a MissingTokenCounterError.

from chunklet.plain_text_chunker import PlainTextChunker

# Simple counter for demonstration purpose
def word_counter(text: str) -> int:
    return len(text.split())

chunker = PlainTextChunker(token_counter=word_counter)         # (1)!

chunks = chunker.chunk(
    text=text,
    lang="auto",
    max_tokens=12,
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(f"Content: {chunk.content}")
    print()
  1. Initializes PlainTextChunker with a custom word_counter function. This function will be used to count tokens when max_tokens is used.
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (0, 291)}
Content: # Introduction to Chunking
This is the first paragraph of our document.
It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization.
We aim to break down large documents into manageable, context-rich pieces.

## Why is Chunking Important?

--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (256, 573)}
Content: ...
## Why is Chunking Important?

Effective chunking helps in maintaining the semantic coherence of information.
It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.

### Different Strategies
There are several strategies for chunking,

--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (531, 880)}
Content: There are several strategies for chunking,
including splitting by sentences, by a fixed number of tokens, or by structural elements like headings.
Each method has its own advantages depending on the specific use case.

---

## Advanced Chunking Techniques
Beyond basic splitting, advanced techniques involve understanding the document's structure.

--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (808, 1153)}
Content: ... advanced techniques involve understanding the document's structure.

For instance, preserving section breaks can significantly improve the quality of chunks for hierarchical documents.
This section will delve into such methods.

### Overlap Considerations
To ensure smooth transitions between chunks, an overlap mechanism is often employed.

--- Chunk 5 ---
Metadata: {'chunk_num': 5, 'span': (1109, 1377)}
Content: ... an overlap mechanism is often employed.

This means that a portion of the previous chunk is included in the beginning of the next, providing continuity.

---

# Conclusion
In conclusion, mastering chunking is key to unlocking the full potential of your text data.

--- Chunk 6 ---
Metadata: {'chunk_num': 6, 'span': (1296, 1459)}
Content: ... mastering chunking is key to unlocking the full potential of your text data.

Experiment with different constraints to find the optimal strategy for your needs.

Overrides token_counter

You can also provide the token_counter directly to the chunk method. within the chunk method call (e.g., chunker.chunk(..., token_counter=my_tokenizer_function)). If a token_counter is provided in both the constructor and the chunk method, the one in the chunk method will be used.

Chunking by Section Breaks: Structure Superhero! πŸ¦Έβ€β™€οΈ

This constraint is useful for documents structured with Markdown headings or thematic breaks.

1
2
3
4
5
6
7
8
9
chunks = chunker.chunk(
    text=text,
    max_section_breaks=2
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(f"Content: {chunk.content}")
    print()
Click to show output
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (0, 503)}
Content: # Introduction to Chunking
This is the first paragraph of our document.
It discusses the importance of text segmentation for various NLP tasks, such as RAG systems and summarization.
We aim to break down large documents into manageable, context-rich pieces.

## Why is Chunking Important?
Effective chunking helps in maintaining the semantic coherence of information.
It ensures that each piece of text retains enough context to be meaningful on its own, which is crucial for downstream applications.

--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (371, 753)}
Content: It ensures that each piece of text retains enough context to be meaningful on its own,
which is crucial for downstream applications.

### Different Strategies
There are several strategies for chunking, including splitting by sentences, by a fixed number of tokens, or by structural elements like headings.
Each method has its own advantages depending on the specific use case.

---

--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (678, 1038)}
Content: Each method has its own advantages depending on the specific use case.

---

## Advanced Chunking Techniques
Beyond basic splitting, advanced techniques involve understanding the document's structure.
For instance, preserving section breaks can significantly improve the quality of chunks for hierarchical documents.
This section will delve into such methods.

--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (890, 1269)}
Content: ... preserving section breaks can significantly improve the quality of chunks for hierarchical documents.
This section will delve into such methods.

### Overlap Considerations
To ensure smooth transitions between chunks, an overlap mechanism is often employed.
This means that a portion of the previous chunk is included in the beginning of the next, providing continuity.

---

--- Chunk 5 ---
Metadata: {'chunk_num': 5, 'span': (1239, 1459)}
Content: ... providing continuity.

---

# Conclusion
In conclusion, mastering chunking is key to unlocking the full potential of your text data.
Experiment with different constraints to find the optimal strategy for your needs.

Adding Base Metadata

You can pass a base_metadata dictionary to the chunk method. This metadata will be included in the metadata of each chunk. For example: chunker.chunk(..., base_metadata={"source": "my_document.txt"}). For more details on metadata structure and available fields, see the Metadata guide.

Combining Multiple Constraints: Mix and Match Magic! 🎭

The real power of PlainTextChunker comes from combining multiple constraints. This allows for highly specific and granular control over how your text is chunked. Here are a few examples of how you can combine different constraints.

Token Counter Requirement

Remember, whenever you use the max_tokens constraint, you must provide a token_counter function.

By Sentences and Tokens

This is useful when you want to limit by both the number of sentences and the overall token count, whichever is reached first.

1
2
3
4
5
chunks = chunker.chunk(
    text=text,
    max_sentences=5,
    max_tokens=100
)

By Sentences and Section Breaks

This combination is great for ensuring that chunks don't span across too many sections while also keeping the sentence count in check.

1
2
3
4
5
chunks = chunker.chunk(
    text=text,
    max_sentences=10,
    max_section_breaks=2
)

By Tokens and Section Breaks

A powerful combination for structured documents where you want to respect section boundaries while adhering to a strict token budget.

1
2
3
4
5
chunks = chunker.chunk(
    text=text,
    max_tokens=256,
    max_section_breaks=1
)

By Sentences, Tokens, and Section Breaks

For the ultimate level of control, you can combine all three constraints. The chunking will stop as soon as any of the three limits is reached.

1
2
3
4
5
6
chunks = chunker.chunk(
    text=text,
    max_sentences=8,
    max_tokens=200,
    max_section_breaks=2
)

Customizing the Continuation Marker

You can customize the continuation marker, which is prepended to clauses that don't fit in the previous chunk. To do this, pass the continuation_marker parameter to the chunker's constructor.

chunker = PlainTextChunker(continuation_marker="[...]")

If you don't want any continuation marker, you can set it to an empty string:

chunker = PlainTextChunker(continuation_marker="")

Batch Run: Processing Multiple Texts Like a Pro! πŸ“š

While chunk is perfect for single texts, batch_chunk is your power player for processing multiple texts in parallel. It uses a memory-friendly generator so you can handle massive text collections with ease. It shares most arguments with chunk (like max_sentences, max_tokens, lang, etc.), plus some extra parameters for batch management.

Here's an example of how to use batch_chunk:

from chunklet.plain_text_chunker import PlainTextChunker

def word_counter(text: str) -> int:
    return len(text.split())

EN_TEXT = "This is the first document. It has multiple sentences for chunking. Here is the second document. It is a bit longer to test batch processing effectively. And this is the third document. Short and sweet, but still part of the batch. The fourth document. Another one to add to the collection for testing purposes."
ES_TEXT = "Este es el primer documento. Contiene varias frases para la segmentaciΓ³n de texto. El segundo ejemplo es mΓ‘s extenso. Queremos probar el procesamiento en diferentes idiomas."
FR_TEXT = "Ceci est le premier document. Il est essentiel pour l'évaluation multilingue. Le deuxième document est court mais important. La variation est la clé."

# Initialize PlainTextChunker
chunker = PlainTextChunker(token_counter=word_counter)

chunks = chunker.batch_chunk(
    texts=[EN_TEXT, ES_TEXT, FR_TEXT],
    max_sentences=5,
    max_tokens=20,
    n_jobs=2,                    # (1)!
    on_errors="raise",           # (2)!
    show_progress=True,          # (3)!
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(f"Metadata: {chunk.metadata}")
    print(f"Content: {chunk.content}")
    print()
  1. Specifies the number of parallel processes to use for chunking. The default value is None (use all available CPU cores).
  2. Define how to handle errors during processing. Determines how errors during chunking are handled. If set to "raise" (default), an exception will be raised immediately. If set to "break", the process will be halt and partial result will be returned. If set to "ignore", errors will be silently ignored.
  3. Display a progress bar during batch processing. The default value is False.
Click to show output
  0%|                                              | 0/3 [00:00<?, ?it/s]
--- Chunk 1 ---
Metadata: {'chunk_num': 1, 'span': (0, 97)}
Content: This is the first document.
It has multiple sentences for chunking.
Here is the second document.

--- Chunk 2 ---
Metadata: {'chunk_num': 2, 'span': (96, 202)}
Content: It is a bit longer to test batch processing effectively.
And this is the third document.
Short and sweet,

--- Chunk 3 ---
Metadata: {'chunk_num': 3, 'span': (186, 253)}
Content: Short and sweet,
but still part of the batch.
The fourth document.

--- Chunk 4 ---
Metadata: {'chunk_num': 4, 'span': (252, 311)}
Content: Another one to add to the collection for testing purposes.

--- Chunk 5 ---
Metadata: {'chunk_num': 1, 'span': (0, 118)}
Content: Este es el primer documento.
Contiene varias frases para la segmentaciΓ³n de texto.
El segundo ejemplo es mΓ‘s extenso.

--- Chunk 6 ---
Metadata: {'chunk_num': 2, 'span': (117, 173)}
Content: Queremos probar el procesamiento en diferentes idiomas.

Chunking ...: 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž            | 2/3 [00:00, 10.09it/s]
--- Chunk 7 ---
Metadata: {'chunk_num': 1, 'span': (0, 125)}
Content: Ceci est le premier document.
Il est essentiel pour l'Γ©valuation multilingue.
Le deuxième document est court mais important.

--- Chunk 8 ---
Metadata: {'chunk_num': 2, 'span': (125, 149)}
Content: La variation est la clΓ©.

Chunking ...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00, 19.88it/s]

Generator Cleanup

When using batch_chunk, it's crucial to ensure the generator is properly closed, especially if you don't iterate through all the chunks. This is necessary to release the underlying multiprocessing resources. The recommended way is to use a try...finally block to call close() on the generator. For more details, see the Troubleshooting guide.

Adding Base Metadata to Batches

Just like with the chunk method, you can pass a base_metadata dictionary to batch_chunk. This is useful for adding common information, like a source filename, to all chunks processed in the batch. For more details on metadata structure and available fields, see the Metadata guide.

Separator: Keeping Your Batches Organized! πŸ“‹

The separator parameter lets you add a custom marker that gets yielded after all chunks from a single text are processed. Super handy for batch processing when you want to clearly separate chunks from different source texts.

Quick Note

None won't work as a separator - you'll need something more substantial!

from chunklet.plain_text_chunker import PlainTextChunker
from more_itertools import split_at

chunker = PlainTextChunker()
texts = [
    "This is the first document. It has two sentences.",
    "This is the second document. It also has two sentences."
]
custom_separator = "---END_OF_DOCUMENT---"

chunks_with_separators = chunker.batch_chunk(
    texts,
    max_sentences=1,
    separator=custom_separator,
    show_progress=False,
)

chunk_groups = split_at(chunks_with_separators, lambda x: x == custom_separator)
# Process the results using split_at
for i, doc_chunks in enumerate(chunk_groups):
    if doc_chunks: # (1)!
        print(f"--- Chunks for Document {i+1} ---")
        for chunk in doc_chunks:
            print(f"Content: {chunk.content}")
            print(f"Metadata: {chunk.metadata}")
        print()
  1. Avoid processing the empty list at the end if stream ends with separator
Click to show output
--- Chunks for Document 1 ---
Content: This is the first document.
Metadata: {'chunk_num': 1, 'span': (0, 27)}
Content: It has two sentences.
Metadata: {'chunk_num': 2, 'span': (28, 49)}

--- Chunks for Document 2 ---
Content: This is the second document.
Metadata: {'chunk_num': 1, 'span': (0, 28)}
Content: It also has two sentences.
Metadata: {'chunk_num': 2, 'span': (29, 55)}
API Reference

For complete technical details on the PlainTextChunker class, check out the API documentation.````