What's on This Page
This page highlights the big features and major changes for each version. For all the nitty-gritty details, bug fixes, and technical improvements, check out our full changelog.
What's New in Chunklet v2.1.0! π
β¨ Major Features in v2.1.0
- Interactive Chunk Visualizer: π Launch a web-based interface for real-time chunk visualization, parameter tuning, and exploring your chunking results interactively!
- CLI Visualize Command: π» Use
chunklet visualizeto start the web interface with customizable host, port, and tokenizer options. - Expanded File Format Support: π Added support for ODT files (.odt) and tabular files (.csv and .xlsx) to handle even more document types.
π Bug Fixes in v2.1.0
- Code Chunker Issues: π§ Fixed multiple bugs in CodeChunker including line skipping in oversized blocks, decorator separation, path detection errors, and redundant processing logic.
- CLI Path Validation Bug: Resolved TypeError where len() was called on PosixPath object. Thanks to @arnoldfranz for reporting this issue.
- Hidden Bugs Uncovered: π΅οΈββοΈ Adding comprehensive test coverage revealed and fixed multiple hidden bugs in document chunker batch processing error handling that were previously undetected.
What's New in Chunklet v2.0.1! π
β¨ Patch Fixes in v2.0.1
- CLI Bug Fix: Fixed a tricky unpacking bug in the
splitcommand that was causing incorrect results. The fix properly separates language detection from sentence splitting for accurate output.
What's New in Chunklet v2.0.3! π
β¨ Improvements in v2.0.3
- Enhanced Span Detection: π§ Fixed some hardcoded limits and added adaptive calculations for better span detection across different text lengths.
- Improved Regex Performance: β‘ Switched from fuzzysearch to optimized regex for faster and more precise span finding.
- Dependency Cleanup: π§Ή Removed the fuzzysearch dependency to keep things lighter and simpler.
What's New in Chunklet v2.0.2! π
β¨ Refinements in v2.0.2
- Code Cleanup: π§Ή Removed some debug print statements from the
SentenceSplitterfor cleaner production code.
What's New in Chunklet v2.0.0! π
β¨ Highlights of v2.0.0
- Class Renaming: The
Chunkletclass has been renamed toPlainTextChunkerfor clearer naming. Don't worry about updating your code - our Migration Guide has you covered! - Continuation marker: π Improved the continuation marker logic and exposed its value so you can define your own or disable it entirely.
- Code Chunker Introduction: We're excited to introduce
CodeChunker! π§βπ» This new rule-based, language-agnostic chunker provides smart syntax-aware code splitting - perfect for code-related RAG applications. - Document Chunker Introduction: We're pleased to introduce
DocumentChunker! π This robust tool handles a wide variety of file formats including PDF, DOCX, TXT, MD, RST, RTF, TEX, HTML, and EPUB files. - Expanded Language Support: Β‘Hola! Bonjour! Namaste! π£οΈ We've expanded from 36+ to over 50 languages thanks to our library integrations and smart fallback mechanisms.
- New Constraint Flags: Added
max_section_breaksfor PlainTextChunker and DocumentChunker, plusmax_linesfor CodeChunker - giving you more precise control over chunking. - Improved Error Handling: Added more specific exception types (like
FileProcessingErrorandCallbackError) and centralized batch error handling for clearer feedback and better control. - Flexible Batch Error Handling: The new
on_errorsparameter lets you control what happens when errors occur in batches - you canraise,skip, orbreakas needed. - CLI Refactoring: Streamlined the command-line interface with simplified flags and improved batch processing capabilities for a smoother experience.
- Modularity & Extensibility: Made the library more modular with a dedicated
SentenceSplitterand flexible custom splitter registry for easier customization. - Performance & Memory Optimization: Significant refactoring with generators for batch methods to drastically reduce memory usage, especially for large documents.
- Caching Strategy Refined: We've gone lean and mean! β»οΈ Removed most in-memory caching to prioritize performance, keeping only
count_tokenscached. - Python 3.8/3.9 Support Dropped: Time marches on, and so do we! π°οΈ Dropped official support for Python 3.8 and 3.9 - minimum version is now 3.10.
- CLI Flags Deprecation (--no-cache, --batch, --mode): Cleaned up the CLI by removing redundant flags for a simpler interface.
πΊοΈ Curious About Our Journey?
For a complete list of all changes, fixes, and improvements across versions, check out our detailed Changelog - it's got all the technical details!