Migrating from v1 to v2: Everything Changed (But You'll Be Fine)
Python Version Bump
v2.x.x dropped Python 3.8 and 3.9. Minimum is now 3.10. Update your env if you're stuck on ancient Python.
So you upgraded to v2 and things broke. That's normal. Let me walk you through what changed and how to fix it.
The Breaking Stuff
Here's what blew up between v1 and v2:
Chunklet is now DocumentChunker
I renamed the main class. Why? Because we now have two chunkers (DocumentChunker and CodeChunker), and calling one of them just "Chunklet" was confusing. Now the names actually make sense.
Fix:
use_cache is gone
I removed the use_cache flag. It was doing internal stuff that didn't need your attention anyway. Now caching just works without you having to think about it.
Fix: Delete use_cache=False from your calls.
The mode argument is gone
This was confusing. Instead of saying mode="sentence" or mode="hybrid", now you just pass the limits you want. Whatever you pass determines how it chunks. Simple.
What's different:
- No more mode parameter
- No more default values for max_tokens or max_sentences - you have to pick
- New toy: max_section_breaks lets you chunk by headings, horizontal rules (---, ***, ___), and <details> tags
Fix: Stop using mode. Just pass your limits.
chunk() is now chunk_text()
The method was renamed to be clear about what it takes: strings.
Fix:
batch_chunk() is now `chunk_texts()
For multiple texts use chunk_texts(). The name actually describes what it does now.
Fix:
Language detection moved
The old detect_text_language.py file is gone. Language detection now lives inside SentenceSplitter directly. Most people won't notice because it happens automatically, but if you were calling it directly, here's the fix:
Custom splitters use a registry now
The old custom_splitters parameter in the constructor is gone. Instead, there's a global registry you register with. This means your custom splitters work across all chunker instances, not just one.
Fix:
import re
from chunklet import Chunklet
from typing import List
def my_custom_splitter(text: str) -> List[str]:
return [s.strip() for s in re.split(r'(?<=\.)\\s+', text) if s.strip()]
chunker = Chunklet(
custom_splitters=[
{
"name": "MyCustomEnglishSplitter",
"languages": "en",
"callback": my_custom_splitter,
}
]
)
import re
from chunklet import DocumentChunker
from chunklet.sentence_splitter import custom_splitter_registry
@custom_splitter_registry.register("en", name="MyAwesomeEnglishSplitter")
def my_awesome_splitter(text: str) -> list[str]:
return [s.strip() for s in re.split(r'[.!?]\s+', text) if s.strip()]
chunker = DocumentChunker()
chunks = chunker.chunk_text(text, lang="en", max_sentences=1)
Check the docs for more details.
Exception name changes
A couple exceptions got renamed to be less confusing:
TokenNotProvidedError->MissingTokenCounterError(clearer about what you forgot)- Custom callback errors now throw
CallbackErrorinstead of genericChunkletError(so you know it's your code that broke, not ours)
CLI changed
The CLI got a new structure. Instead of just chunklet "text", you now use chunklet chunk "text".
- Text as argument = DocumentChunker
- File with
--source= DocumentChunker (handles PDFs, DOCX, etc.) - Add
--codeflag = CodeChunker
See CLI docs for the full breakdown.
Automated migration checker
I wrote a script that scans your code for old v1 patterns. It'll point out exactly what needs changing.
curl -O https://raw.githubusercontent.com/speedyk-005/chunklet-py/main/audit_migration.py
python audit_migration.py /path/to/your/project
That's it. Go forth and migrate.