Exceptions and Warnings: When Chunklet Gets a Bit Grumpy
Even the most robust tools have their moments, and Chunklet is no exception (pun intended!). This page is your guide to understanding the various hiccups and murmurs you might encounter while using Chunklet. Don't worry, most of them are easily fixed, and some are just friendly nudges.
Exceptions: When Things Go Sideways (and Stop)
These are the big ones. When Chunklet throws an exception, it means something went wrong enough to halt the process. But fear not, we'll tell you why!
ChunkletError
This is the grand-daddy of all Chunklet-specific errors. If you see this, it means something fundamental went wrong within Chunklet itself. It's usually a sign that a deeper issue occurred, often related to a custom function you provided.
- Common Scenario: Your custom
token_counter
function decided to take a coffee break (i.e., it raised an exception). - What to do: Check your custom
token_counter
or any other custom callbacks you've provided. Make sure they're robust and handle all possible inputs gracefully.
InvalidInputError
Chunklet is a stickler for rules, especially when it comes to your input. This error means you've given Chunklet something it just can't work with.
- Common Scenarios:
- You tried to initialize Chunklet with some funky
custom_splitters
that didn't quite fit the mold. - Your chunking configuration (like
max_tokens
ormode
) was a bit off. - In batch processing, you forgot to provide a list of texts, or you asked for a negative number of parallel jobs (we're good, but not that good).
- You tried to initialize Chunklet with some funky
- What to do: Double-check your input parameters against the documentation. Make sure everything is in the right format and within the expected ranges.
TokenNotProvidedError
This one's pretty self-explanatory, but we'll explain it anyway. If you're trying to chunk by tokens (or in hybrid mode, which also needs token awareness) but haven't told Chunklet how to count tokens, it'll politely (or not so politely) refuse to proceed.
- Common Scenario: You set
mode="token"
ormode="hybrid"
but didn't provide atoken_counter
when initializing Chunklet or in yourchunk()
call. - What to do: Provide a
token_counter
function. You can use a simple word counter, or for more accuracy, integrate a library liketiktoken
. Check the Programmatic Usage documentation for examples.
Warnings: Chunklet's Friendly Nudges (and Occasional Grumbles)
Warnings are Chunklet's way of saying, "Hey, I did what you asked, but you might want to know this..." They don't stop the process, but they often indicate something you could optimize or be aware of.
"The language is set to auto
. Consider setting the lang
parameter to a specific language to improve performance."
- What it means: You've let Chunklet guess the language of your text. While it's pretty good at it, explicitly telling it the language (
lang='en'
,lang='fr'
, etc.) can sometimes speed things up and improve accuracy, especially for shorter texts. - What to do: If you know the language of your text, set the
lang
parameter. If not, no worries, Chunklet will do its best!
"Low confidence in language detected. Detected: '{lang_detected}' with confidence {confidence:.2f}."
- What it means: Chunklet tried its best to detect the language, but it's not super confident about its guess. This often happens with very short texts or texts that mix multiple languages.
- What to do: If you know the language, set the
lang
parameter explicitly. If the text is genuinely ambiguous, just be aware that the sentence splitting might not be perfect.
"Language not supported or detected with low confidence. Universal regex splitter was used."
- What it means: Chunklet couldn't find a specialized sentence splitter for your language (or it wasn't confident enough in its detection), so it fell back to its trusty universal regex splitter. This splitter is robust but might not be as linguistically nuanced as the specialized ones.
- What to do: If you need highly accurate sentence splitting for an unsupported language, consider implementing a Custom Splitter. Otherwise, the universal splitter will still get the job done!
"Using --batch
with --file
is deprecated." (CLI Warning)
- What it means: You're using an older way of batch processing. While it still works, we've introduced a more streamlined approach.
- What to do: For batch processing multiple files, use
--input-dir
. If you're processing a single file, just provide it directly without--batch
. Check the CLI Usage documentation for the latest methods.