Welcome to the Chunklet-py Documentation!

Chunklet-py Logo

“One library to split them all: Sentence, Code, Docs.”

Hey there! Welcome to the Chunklet-py docs. We're stoked you're here - let's make some text chunking magic happen together! ✨

Why Smart Chunking? (Or: Why Not Just Split on Character Count?)

You might be wondering: "Can't I just split my text by character count or random line breaks?" Well, sure you could... but that's like trying to cut a wedding cake with a chainsaw! 🎂 Standard methods often give you:

Mid-sentence surprises: Your carefully crafted thoughts get chopped right in the middle, losing all meaning
Language confusion: Non-English text and code structures get treated like they're all the same
Lost context: Each chunk forgets what came before, like a conversation where everyone has amnesia

Smart chunking keeps your content's meaning and structure intact!

🤔 So What's Chunklet-py Anyway? (And Why Should You Care?)

Chunklet-py is your friendly neighborhood text splitter that takes all kinds of content - from plain text to PDFs to source code - and breaks them into smart, context-aware chunks. Instead of dumb splitting, we give you specialized tools:

Sentence Splitter
Plain Text Chunker
Document Chunker
Code Chunker
Chunk Visualizer (Interactive web interface)

Each tool is designed to keep your content's meaning and structure intact, plus we've got an interactive visualizer so you can see your chunks in real-time.

Perfect for prepping data for LLMs, building RAG systems, or powering AI search - Chunklet-py gives you the precision and flexibility you need across tons of formats and languages.

Blazingly Fast

Leverages efficient parallel processing to chunk large volumes of content with remarkable speed.
Featherlight Footprint

Designed to be lightweight and memory-efficient, ensuring optimal performance without unnecessary overhead.
Rich Metadata for RAG

Enriches chunks with valuable, context-aware metadata (source, span, document properties, code AST details) crucial for advanced RAG and LLM applications.
Infinitely Customizable

Offers extensive customization options, from pluggable token counters to custom sentence splitters and processors.
Multilingual Mastery

Supports over 50 natural languages for text and document chunking with intelligent detection and language-specific algorithms.
Code-Aware Intelligence

Language-agnostic code chunking that understands and preserves the structural integrity of your source code.
Precision Chunking

Flexible constraint-based chunking allows you to combine limits based on sentences, tokens, sections, lines, and functions.
Triple Interface: CLI, Library & Web

Use it as a command-line tool, import as a library for deep integration, or launch the interactive web visualizer for real-time chunk exploration and parameter tuning.

Ready to Get Started? Let's Make Some Chunks! 🚀

Welcome aboard! You're about to turn unruly walls of text into neat, manageable chunks. No more text-wrangling nightmares - Chunklet-py has your back!

Here's your quick start guide:

Installation: Get Chunklet-py running in minutes - seriously, it's that easy!
Pick Your Path: Got a preferred way of working? We've got you covered:
- CLI Fan? Love the terminal and instant results? The command line interface is perfect for quick tasks and scripting.
  - Check out CLI Usage
- Code Ninja? Want to integrate chunking into your Python projects? The library approach gives you full control.
  - Explore Programmatic Usage

Whatever you choose, we're here to make chunking as smooth and maybe even a little fun. Let's do this!

How Does Chunklet-py Stack Up?

Wondering how we compare to other chunking tools? Chunklet-py brings a unique mix of versatility, speed, and simplicity. Here's the quick comparison:

Library	Key Differentiator	Focus
chunklet-py	All-in-one, lightweight, and language-agnostic with specialized algorithms.	Text, Code, Docs
CintraAI Code Chunker	Relies on `tree-sitter`, which can add setup complexity.	Code
Chonkie	A feature-rich pipeline tool with cloud/vector integrations, but uses a more basic sentence splitter and `tree-sitter` for code.	Pipelines, Integrations
code_chunker (JimAiMoment)	Uses basic regex and rules with limited language support.	Code
Semchunk	Primarily for text, using a general-purpose sentence splitter.	Text

Chunklet-py uses smart rule-based approaches that skip heavy dependencies (looking at you, tree-sitter!) and potential compatibility headaches. Our sentence splitting uses specialized algorithms for better accuracy, and the interactive visualizer lets you tweak settings in real-time. Perfect for projects that want power, flexibility, and a lightweight footprint.

The Full Tour

Curious about all the features?

Supported Languages: See which languages Chunklet speaks fluently.
Exceptions and Warnings: Because sometimes, things go wrong. Here's what to do when they do.
Metadata: Understand the rich context chunklet attaches to your chunks.
Troubleshooting: Solutions to common issues you might encounter.

Stay in the Loop

Want to keep up with Chunklet-py's latest adventures?

What's New: Discover all the exciting new features and improvements in Chunklet 2.1.0.
Migration Guide: Learn how to smoothly transition from previous versions to Chunklet 2.x.x.
Changelog: See what's new, what's fixed, and what's been improved in recent versions.

🗺 What's Working & What's Next

Already rocking these features: - [x] CLI interface for quick chunking - [x] Document chunking with rich metadata - [x] Smart code chunking that respects structure - [x] Interactive web visualizer - [x] Bonus file formats: ODT, CSV, Excel

Coming soon (we're excited about these!): - [ ] Even more document formats

Project Details & Join the Fun

For the behind-the-scenes info and if you're thinking of contributing:

GitHub Repository: The main hub for all things Chunklet.
License Information: All the necessary bits and bobs about Chunklet's license.
Contributing: Want to help make Chunklet even better? Find out how you can contribute!