Skip to content

What's on This Page

This page highlights the big features and major changes for each version. For all the nitty-gritty details, bug fixes, and technical improvements, check out our full changelog.

What's New in Chunklet v2.1.0! πŸŽ‰

✨ Major Features in v2.1.0

  • Interactive Chunk Visualizer: 🌐 Launch a web-based interface for real-time chunk visualization, parameter tuning, and exploring your chunking results interactively!
  • CLI Visualize Command: πŸ’» Use chunklet visualize to start the web interface with customizable host, port, and tokenizer options.
  • Expanded File Format Support: πŸ“ Added support for ODT files (.odt) and tabular files (.csv and .xlsx) to handle even more document types.

πŸ› Bug Fixes in v2.1.0

  • Code Chunker Issues: πŸ”§ Fixed multiple bugs in CodeChunker including line skipping in oversized blocks, decorator separation, path detection errors, and redundant processing logic.
  • CLI Path Validation Bug: Resolved TypeError where len() was called on PosixPath object. Thanks to @arnoldfranz for reporting this issue.
  • Hidden Bugs Uncovered: πŸ•΅οΈβ€β™‚οΈ Adding comprehensive test coverage revealed and fixed multiple hidden bugs in document chunker batch processing error handling that were previously undetected.

What's New in Chunklet v2.0.1! πŸŽ‰

✨ Patch Fixes in v2.0.1

  • CLI Bug Fix: Fixed a tricky unpacking bug in the split command that was causing incorrect results. The fix properly separates language detection from sentence splitting for accurate output.

What's New in Chunklet v2.0.3! πŸŽ‰

✨ Improvements in v2.0.3

  • Enhanced Span Detection: 🧭 Fixed some hardcoded limits and added adaptive calculations for better span detection across different text lengths.
  • Improved Regex Performance: ⚑ Switched from fuzzysearch to optimized regex for faster and more precise span finding.
  • Dependency Cleanup: 🧹 Removed the fuzzysearch dependency to keep things lighter and simpler.

What's New in Chunklet v2.0.2! πŸŽ‰

✨ Refinements in v2.0.2

  • Code Cleanup: 🧹 Removed some debug print statements from the SentenceSplitter for cleaner production code.

What's New in Chunklet v2.0.0! πŸŽ‰

✨ Highlights of v2.0.0

  • Class Renaming: The Chunklet class has been renamed to PlainTextChunker for clearer naming. Don't worry about updating your code - our Migration Guide has you covered!
  • Continuation marker: πŸ“‘ Improved the continuation marker logic and exposed its value so you can define your own or disable it entirely.
  • Code Chunker Introduction: We're excited to introduce CodeChunker! πŸ§‘β€πŸ’» This new rule-based, language-agnostic chunker provides smart syntax-aware code splitting - perfect for code-related RAG applications.
  • Document Chunker Introduction: We're pleased to introduce DocumentChunker! πŸ“„ This robust tool handles a wide variety of file formats including PDF, DOCX, TXT, MD, RST, RTF, TEX, HTML, and EPUB files.
  • Expanded Language Support: Β‘Hola! Bonjour! Namaste! πŸ—£οΈ We've expanded from 36+ to over 50 languages thanks to our library integrations and smart fallback mechanisms.
  • New Constraint Flags: Added max_section_breaks for PlainTextChunker and DocumentChunker, plus max_lines for CodeChunker - giving you more precise control over chunking.
  • Improved Error Handling: Added more specific exception types (like FileProcessingError and CallbackError) and centralized batch error handling for clearer feedback and better control.
  • Flexible Batch Error Handling: The new on_errors parameter lets you control what happens when errors occur in batches - you can raise, skip, or break as needed.
  • CLI Refactoring: Streamlined the command-line interface with simplified flags and improved batch processing capabilities for a smoother experience.
  • Modularity & Extensibility: Made the library more modular with a dedicated SentenceSplitter and flexible custom splitter registry for easier customization.
  • Performance & Memory Optimization: Significant refactoring with generators for batch methods to drastically reduce memory usage, especially for large documents.
  • Caching Strategy Refined: We've gone lean and mean! ♻️ Removed most in-memory caching to prioritize performance, keeping only count_tokens cached.
  • Python 3.8/3.9 Support Dropped: Time marches on, and so do we! πŸ•°οΈ Dropped official support for Python 3.8 and 3.9 - minimum version is now 3.10.
  • CLI Flags Deprecation (--no-cache, --batch, --mode): Cleaned up the CLI by removing redundant flags for a simpler interface.

πŸ—ΊοΈ Curious About Our Journey?

For a complete list of all changes, fixes, and improvements across versions, check out our detailed Changelog - it's got all the technical details!