Welcome to the Chunklet-py Documentation!
βOne library to split them all: Sentence, Code, Docs.β
Hey there! Welcome to the Chunklet-py docs. We're stoked you're here - let's make some text chunking magic happen together! β¨
Why Smart Chunking? (Or: Why Not Just Split on Character Count?)
You might be wondering: "Can't I just split my text by character count or random line breaks?" Well, sure you could... but that's like trying to cut a wedding cake with a chainsaw! π Standard methods often give you:
- Mid-sentence surprises: Your carefully crafted thoughts get chopped right in the middle, losing all meaning
- Language confusion: Non-English text and code structures get treated like they're all the same
- Lost context: Each chunk forgets what came before, like a conversation where everyone has amnesia
Smart chunking keeps your content's meaning and structure intact!
π€ So What's Chunklet-py Anyway? (And Why Should You Care?)
Chunklet-py is your friendly neighborhood text splitter that takes all kinds of content - from plain text to PDFs to source code - and breaks them into smart, context-aware chunks. Instead of dumb splitting, we give you specialized tools:
Sentence SplitterPlain Text ChunkerDocument ChunkerCode ChunkerChunk Visualizer(Interactive web interface)
Each tool is designed to keep your content's meaning and structure intact, plus we've got an interactive visualizer so you can see your chunks in real-time.
Perfect for prepping data for LLMs, building RAG systems, or powering AI search - Chunklet-py gives you the precision and flexibility you need across tons of formats and languages.
-
Blazingly Fast
Leverages efficient parallel processing to chunk large volumes of content with remarkable speed.
-
Featherlight Footprint
Designed to be lightweight and memory-efficient, ensuring optimal performance without unnecessary overhead.
-
Rich Metadata for RAG
Enriches chunks with valuable, context-aware metadata (source, span, document properties, code AST details) crucial for advanced RAG and LLM applications.
-
Infinitely Customizable
Offers extensive customization options, from pluggable token counters to custom sentence splitters and processors.
-
Multilingual Mastery
Supports over 50 natural languages for text and document chunking with intelligent detection and language-specific algorithms.
-
Code-Aware Intelligence
Language-agnostic code chunking that understands and preserves the structural integrity of your source code.
-
Precision Chunking
Flexible constraint-based chunking allows you to combine limits based on sentences, tokens, sections, lines, and functions.
-
Triple Interface: CLI, Library & Web
Use it as a command-line tool, import as a library for deep integration, or launch the interactive web visualizer for real-time chunk exploration and parameter tuning.
Ready to Get Started? Let's Make Some Chunks! π
Welcome aboard! You're about to turn unruly walls of text into neat, manageable chunks. No more text-wrangling nightmares - Chunklet-py has your back!
Here's your quick start guide:
-
Installation: Get Chunklet-py running in minutes - seriously, it's that easy!
-
Pick Your Path: Got a preferred way of working? We've got you covered:
-
CLI Fan? Love the terminal and instant results? The command line interface is perfect for quick tasks and scripting.
-
Code Ninja? Want to integrate chunking into your Python projects? The library approach gives you full control.
-
Whatever you choose, we're here to make chunking as smooth and maybe even a little fun. Let's do this!
How Does Chunklet-py Stack Up?
Wondering how we compare to other chunking tools? Chunklet-py brings a unique mix of versatility, speed, and simplicity. Here's the quick comparison:
| Library | Key Differentiator | Focus |
|---|---|---|
| chunklet-py | All-in-one, lightweight, and language-agnostic with specialized algorithms. | Text, Code, Docs |
| CintraAI Code Chunker | Relies on tree-sitter, which can add setup complexity. |
Code |
| Chonkie | A feature-rich pipeline tool with cloud/vector integrations, but uses a more basic sentence splitter and tree-sitter for code. |
Pipelines, Integrations |
| code_chunker (JimAiMoment) | Uses basic regex and rules with limited language support. | Code |
| Semchunk | Primarily for text, using a general-purpose sentence splitter. | Text |
Chunklet-py uses smart rule-based approaches that skip heavy dependencies (looking at you, tree-sitter!) and potential compatibility headaches. Our sentence splitting uses specialized algorithms for better accuracy, and the interactive visualizer lets you tweak settings in real-time. Perfect for projects that want power, flexibility, and a lightweight footprint.
The Full Tour
Curious about all the features?
- Supported Languages: See which languages Chunklet speaks fluently.
- Exceptions and Warnings: Because sometimes, things go wrong. Here's what to do when they do.
- Metadata: Understand the rich context
chunkletattaches to your chunks. - Troubleshooting: Solutions to common issues you might encounter.
Stay in the Loop
Want to keep up with Chunklet-py's latest adventures?
- What's New: Discover all the exciting new features and improvements in Chunklet 2.1.0.
-
Migration Guide: Learn how to smoothly transition from previous versions to Chunklet 2.x.x.
-
Changelog: See what's new, what's fixed, and what's been improved in recent versions.
πΊ What's Working & What's Next
Already rocking these features: - [x] CLI interface for quick chunking - [x] Document chunking with rich metadata - [x] Smart code chunking that respects structure - [x] Interactive web visualizer - [x] Bonus file formats: ODT, CSV, Excel
Coming soon (we're excited about these!): - [ ] Even more document formats
Project Details & Join the Fun
For the behind-the-scenes info and if you're thinking of contributing:
- GitHub Repository: The main hub for all things Chunklet.
- License Information: All the necessary bits and bobs about Chunklet's license.
- Contributing: Want to help make Chunklet even better? Find out how you can contribute!