chunklet

Chunklet: Advanced Text, Code, and Document Chunking for LLM Applications

A comprehensive library for semantic text segmentation, interactive chunk visualization, and multi-format document processing. Split content intelligently across 50+ languages, visualize chunks in real-time, and handle various file types with flexible, context-aware chunking strategies.

Key Features:

Sentence splitting: Multilingual text segmentation across 50+ languages
Semantic chunking: PlainTextChunker, DocumentChunker, and CodeChunker
Interactive visualization: Web-based chunk exploration and parameter tuning
Multi-format support: Text, code, PDF, DOCX, EPUB, and more
Batch processing: Memory-optimized generators with flexible error handling

Note

PlainTextChunker has been merged into DocumentChunker since v2.2.0. Use DocumentChunker.chunk_text() or DocumentChunker.chunk_texts() instead.

Modules:

base_chunker –

Base Chunker Abstract Class
cli –
code_chunker –
common –
document_chunker –
exceptions –
sentence_splitter –
visualizer –

Classes:

CallbackError –

Raised when a callback function provided to chunker
ChunkletError –

Base exception for chunking and splitting
FileProcessingError –

Raised when a file cannot be loaded, opened, or
InvalidInputError –

Raised when one or multiple invalid input(s) are
MissingTokenCounterError –

Raised when a token_counter is required but not
TokenLimitError –

Raised when max_tokens constraint is exceeded.
UnsupportedFileTypeError –

Raised when a file type is not supported for a given operation.

CallbackError

Bases: ChunkletError

Raised when a callback function provided to chunker or splitter fails during execution.

ChunkletError

Bases: Exception

Base exception for chunking and splitting operations.

FileProcessingError

Bases: ChunkletError

Raised when a file cannot be loaded, opened, or accessed.

InvalidInputError

Bases: ChunkletError

Raised when one or multiple invalid input(s) are encountered.

MissingTokenCounterError

MissingTokenCounterError(msg: str = '')

Bases: InvalidInputError

Raised when a token_counter is required but not provided.

Source code in src/chunklet/exceptions.py

def __init__(self, msg: str = ""):
    self.msg = msg or (
        "A token_counter is required for token-based chunking.\n"
        "💡 Hint: Pass a token counting function to the chunking method, like `chunker.chunk_text(..., token_counter=tk)`\n"
        "or configure it in the class initialization: `.*Chunker(token_counter=tk)`"
    )
    super().__init__(self.msg)

TokenLimitError

Bases: ChunkletError

Raised when max_tokens constraint is exceeded.

UnsupportedFileTypeError

Bases: FileProcessingError

Raised when a file type is not supported for a given operation.

This can happen if: - The file extension is not in the supported list - The file has no extension - The processor returns an iterable (requires batch processing)