Skip to content

Welcome to the Chunklet-py Documentation!


Chunklet Logo


What is Chunklet, Anyway? (And Why Should You Care?)

So, you've got a mountain of text and a tiny little pickaxe? Fear not! You've stumbled upon the official (and slightly quirky) documentation for Chunklet-py, your new heavy machinery for demolishing text into perfectly sized, context-aware chunks.

At its core, Chunklet is a smart text chunking utility. Whether you're preparing data for Large Language Models (LLMs), building a Retrieval-Augmented Generation (RAG) system, or just need to break down a long document, Chunklet has your back. It handles the messy business of text segmentation so you don't have to.

Chunklet is a Python library for multilingual, context-aware text chunking optimized for large language model (LLM) and retrieval-augmented generation (RAG) pipelines. It splits long documents into manageable segments while preserving semantic boundaries, enabling efficient indexing, embedding, and inference.

Did You Know?

πŸ’‘ Tip: Chunklet's overlap_percent works at the clause level, not just sentence or token boundaries! This means it intelligently preserves semantic flow across chunks, making your LLMs smarter and your RAG pipelines more effective.

Why Bother with Fancy Chunking?

Look, you could just split your text by character count or paragraphs. But let's be honest, that's like performing surgery with a butter knife. Standard splitting methods often:

  • Commit literary butchery: They'll chop sentences right in the middle of a thought.
  • Get lost in translation: They don't care about the rules of non-English languages.
  • Have the memory of a goldfish: They forget the context of the previous chunk, leaving you with a mess of disconnected ideas.

Chunklet is the smart surgeon. It understands the structure of your text, using fancy tricks like clause-level overlapping to keep the meaning intact. It's like a linguistic artist, carefully preserving the masterpiece that is your data.

πŸ€” Why Chunklet?

Feature Why it’s elite
⛓️ Hybrid Mode Combines token + sentence limits with guaranteed overlap β€” rare even in commercial stacks.
🌐 Multilingual Fallbacks Pysbd > SentenceSplitter > Regex, with dynamic confidence detection.
➿ Clause-Level Overlap `overlap_percent operates at the clause level, preserving semantic flow across chunks using logic.
⚑ Parallel Batch Processing Efficient parallel processing with ThreadPoolExecutor, optimized for low overhead on small batches.
♻️ LRU Caching Smart memoization via functools.lru_cache.
πŸͺ„ Pluggable Token Counters Swap in GPT-2, BPE, or your own tokenizer.
βœ‚οΈ Pluggable Sentence splitters Integrate custom splitters for more specific languages.

Ready to Dive In?

Here's how to get your hands dirty:

The Grand Tour

Wanna know what's under the hood?

## Keeping Up-to-Date

Stay informed about Chunklet's evolution:

  • Changelog: See what's new, what's fixed, and what's been improved in recent versions.
  • Benchmarks: Curious about performance? Check out how Chunklet stacks up.

## Project Information & Contributing For the serious stuff (and if you want to join the fun):