Skip to content

Chunklet-py Docs

Chunklet-py Logo

β€œOne library to split them all: Sentence, Code, Docs.”

Hey! Welcome to the Chunklet-py docs. Let's make some text chunking magic happen.


Why Smart Chunking? (Or: Why Not Just Split on Character Count?)

You could split your text by character count or random line breaks. But that's like trying to cut a wedding cake with a chainsaw. πŸŽ‚

Dumb splitting causes problems:

  • Mid-sentence surprises: Your thoughts get chopped mid-way, losing all meaning
  • Language confusion: Non-English text and code structures get treated the same
  • Lost context: Each chunk forgets what came before

Smart chunking solves this by:

  • Smart limits β€” Respects both natural boundaries (sentences, paragraphs, sections) AND configurable limits (tokens, lines, functions)
  • Language-aware β€” Detects language automatically and applies the right rules (50+ languages supported)
  • Context preservation β€” Overlap between chunks, rich metadata (source, span, document structure)

πŸ€” So What's Chunklet-py Anyway? (And Why Should You Care?)

Chunklet-py is a developer-friendly text splitting library designed to be the most versatile chunking solution β€” for devs, researchers, and AI engineers. It goes way beyond basic character counting. I built this because I was tired of terrible chunking options. Chunklet-py intelligently chunks text, documents, and code into meaningful, context-aware pieces β€” perfect for RAG pipelines and LLM applications.

Key features:

  • Composable constraints β€” Mix and match limits (sentences, tokens, sections) to get exactly the chunks you need
  • Pluggable architecture β€” Swap in custom tokenizers, sentence splitters, or processors
  • Rich metadata β€” Every chunk comes with source references, spans, and structural info
  • Multi-format support β€” PDF, DOCX, EPUB, Markdown, HTML, LaTeX, ODT, CSV, Excel, and plain text

Available tools:

  • SentenceSplitter β€” Lightweight sentence tokenization
  • DocumentChunker β€” Natural language with semantic boundaries
  • CodeChunker β€” Language-aware code chunking
  • ChunkVisualizer β€” Interactive web-based exploration

Perfect for prepping data for LLMs, building RAG systems, or powering AI search - Chunklet-py gives you the precision and flexibility you need across tons of formats and languages.

  • Blazingly Fast

    Leverages efficient parallel processing to chunk large volumes of content with remarkable speed.

  • Featherlight Footprint

    Designed to be lightweight and memory-efficient, ensuring optimal performance without unnecessary overhead.

  • Rich Metadata for RAG

    Enriches chunks with valuable, context-aware metadata (source, span, document properties, code AST details) crucial for advanced RAG and LLM applications.

  • Infinitely Customizable

    Offers extensive customization options, from pluggable token counters to custom sentence splitters and processors.

  • Multilingual Mastery

    Supports over 50 natural languages for text and document chunking with intelligent detection and language-specific algorithms.

  • Code-Aware Intelligence

    Language-agnostic code chunking that understands and preserves the structural integrity of your source code.

  • Precision Chunking

    Flexible chunking with configurable limits based on sentences, tokens, sections, lines, and functions.

  • Triple Interface: CLI, Library & Web

    Use it as a command-line tool, import as a library for deep integration, or launch the interactive web visualizer for real-time chunk exploration and parameter tuning.


How Does Chunklet-py Stack Up?

Wondering how we compare to other chunking tools? Here's the quick comparison:

Library Key Differentiator Focus
chunklet-py All-in-one, lightweight, multilingual, language-agnostic with specialized algorithms. Text, Code, Docs
LangChain Full LLM framework with basic splitters (e.g., RecursiveCharacterTextSplitter, Markdown, HTML, code splitters). Good for prototyping but basic for complex docs or multilingual needs. Full Stack
Chonkie All-in-one pipeline (chunking + embeddings + vector DB). Uses tree-sitter for code. Multilingual. Pipelines
Semchunk Text-only, fast semantic splitting. Built-in tiktoken/HuggingFace support. 85% faster than alternatives. Text
CintraAI Code Chunker Code-specific, uses tree-sitter. Initially supports Python, JS, CSS only. Code

Chunklet-py is a specialized, drop-in replacement for the chunking step in any RAG pipeline. It handles text, documents, and code without heavy dependencies, while keeping your project lightweight.

Ready? Let's Go!

Pick your path:

  • Installation: Get Chunklet-py running in minutes
  • CLI Fan? The command line interface is perfect for quick tasks.
  • Code Ninja? Want to integrate chunking into your Python projects?

The Full Tour

Curious about all the features?

Stay in the Loop

Want to keep up with Chunklet-py's latest adventures?

  • What's New: Discover all the exciting new features and improvements in Chunklet 2.2.0.
  • Migration Guide: Learn how to smoothly transition from previous versions to Chunklet 2.x.x.

  • Changelog: See what's new, what's fixed, and what's been improved in recent versions.

Project Details & Join the Fun

For the behind-the-scenes info and if you're thinking of contributing: