Skip to content

chunklet

Chunklet: The v2.0.0 Evolution - Multi-strategy, Context-aware, Multilingual Text & Code Chunker

This package provides a robust and flexible solution for splitting large texts and code into smaller, manageable chunks. Designed for applications like Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) pipelines, and other context-aware Natural Language Processing (NLP) tasks.

Version 2.0.0 introduces a revamped architecture with: - Dedicated chunkers: PlainTextChunker (formerly Chunklet), DocumentChunker, and CodeChunker. - Expanded language support (50+ languages) and improved error handling. - Flexible batch processing with on_errors parameter and memory-optimized generators. - Enhanced modularity, extensibility, and performance.

Modules:

Classes:

CallbackError

Bases: ChunkletError

Raised when a callback function provided to chunker or splitter fails during execution.

ChunkletError

Bases: Exception

Base exception for chunking and splitting operations.

FileProcessingError

Bases: ChunkletError

Raised when a file cannot be loaded, opened, or accessed.

InvalidInputError

Bases: ChunkletError

Raised when one or multiple invalid input(s) are encountered.

MissingTokenCounterError

MissingTokenCounterError(msg: str = '')

Bases: InvalidInputError

Raised when a token_counter is required but not provided.

Source code in src/chunklet/exceptions.py
def __init__(self, msg: str = ""):
    self.msg = msg or (
        "A token_counter is required for token-based chunking.\n"
        "💡 Hint: Pass a token counting function to the `chunk` method, like `chunker.chunk(..., token_counter=tk)`\n"
        "or configure it in the class initialization: `.*Chunker(token_counter=tk)`"
    )
    super().__init__(self.msg)

TokenLimitError

Bases: ChunkletError

Raised when max_tokens constraint is exceeded.

UnsupportedFileTypeError

Bases: FileProcessingError

Raised when a file type is not supported for a given operation.