Sentence Splitter
The Art of Precise Sentence Splitting ✂️
Splitting text by periods is like trying to perform surgery with a butter knife — it barely works and makes a mess. Abbreviations get misinterpreted, sentences get cut mid-thought, and your NLP models end up confused.
This problem has a name: Sentence Boundary Disambiguation. That's where SentenceSplitter comes in.
Think of it as a skilled linguist who knows where sentences actually end. It handles grammar, context, and those tricky abbreviations (like "Dr." or "U.S.A.") without breaking a sweat. Supports 50+ languages out of the box.
What's Under the Hood? ⚙️
The SentenceSplitter is a sophisticated system:
- Multilingual Support 🌍: Handles over 50 languages with intelligent detection. See the full list.
- Custom Splitters 🔧: Plug in your own splitting logic for specialized languages or domains.
- Reliable Fallback 🛡️: For unsupported languages, a rule-based fallback kicks in.
- Error Monitoring 🔍: Reports issues with custom splitters clearly.
- Output Refinement ✨: Removes empty sentences and fixes punctuation.
Example Usage
Split Text into Sentences
Here's a quick example of how you can use the SentenceSplitter to split a block of text into sentences:
- Auto language detection: Let the splitter automatically detect the language of your text. For best results, specify a language code like
"en"or"fr"directly.
Click to show output
2025-11-02 16:27:29.277 | WARNING | chunklet.sentence_splitter.sentence_splitter:split_text:192 - The language is set to `auto`. Consider setting the `lang` parameter to a specific language to improve reliability.
2025-11-02 16:27:29.316 | INFO | chunklet.sentence_splitter.sentence_splitter:detected_top_language:146 - Language detection: 'en' with confidence 10/10.
2025-11-02 16:27:29.447 | INFO | chunklet.sentence_splitter.sentence_splitter:split_text:166 - Text splitted into sentences. Total sentences detected: 19
She loves cooking.
He studies AI.
"You are a Dr.", she said.
The weather is great.
We play chess.
Books are fun, aren't they?
The Playlist contains:
- two videos
- one image
- one music
Robots are learning.
It's raining.
Let's code.
Mars is red.
Sr. sleep is rare.
Consider item 1.
This is a test.
The year is 2025.
This is a good year since N.A.S.A. reached 123.4 light year more.
Splitting Files: From Document to Sentences 📄
Need to split a file directly into sentences? Use split_file:
Click to show output
Detecting Top Languages 🎯
Here's how you can detect the top language of a given text using the SentenceSplitter:
Click to show output
Original language: en
Detected language: en with confidence 1.00
--------------------
Original language: fr
Detected language: fr with confidence 1.00
--------------------
Original language: es
Detected language: es with confidence 1.00
--------------------
Original language: de
Detected language: de with confidence 1.00
--------------------
Original language: hi
Detected language: hi with confidence 1.00
--------------------
Custom Sentence Splitter: Your Playground 🎨
Want to bring your own splitting logic? You can plug in custom splitter functions to Chunklet! Perfect for specialized languages or domains.
Global Registry Alert!
Custom splitters get registered globally - once you add one, it's available everywhere in your app. Watch out for side effects if you're registering splitters across different parts of your codebase, especially in multi-threaded or long-running applications!
To use a custom splitter, you leverage the @registry.register decorator. This decorator allows you to register your function for one or more languages directly. Your custom splitter function must accept a single text parameter (str) and return a list[str] of sentences.
Custom Splitter Rules
- Your function must accept exactly one required parameter (the text)
- Optional parameters with defaults are totally fine
- Must return a list of strings
- Empty strings get filtered out automatically
- Lambda functions work if you provide a
nameparameter - Errors during splitting will raise a
CallbackError
Basic Custom Splitter
Create a custom sentence splitter for a single language using the registry decorator:
Click to show output
Multi-Language Custom Splitter
Register the same splitter function for multiple languages at once:
- This registers the same custom splitter for both French ("fr") and Spanish ("es") languages.
Unregistering Custom Splitters
Remove a registered custom splitter when you no longer need it:
- This will remove the custom splitter associated with the "en" language code. Note that you can unregister multiple languages if you had registered them with the same function:
registry.unregister("fr", "es")
Skip the Decorator?
Not a fan of decorators? No worries - you can directly use the registry.register() method. Super handy for dynamic registration or when your callback function isn't in the global scope.
Want to Build from Scratch?
Going full custom? Inherit from the BaseSplitter abstract class! It gives you a clear interface (def split(self, text: str, lang: str) -> list[str]) to implement. Your custom splitter will then work seamlessly with DocumentChunker.
CustomSplitterRegistry Methods Summary
splitters: Returns a shallow copy of the dictionary of registered splitters.is_registered(lang: str): Checks if a splitter is registered for the given language, returningTrueorFalse.register(callback: Callable[[str], list[str]] | None = None, *langs: str, name: str | None = None): Registers a splitter callback for one or more languages.unregister(*langs: str): Removes splitter(s) from the registry.clear(): Clears all registered splitters from the registry.split(text: str, lang: str): Processes a text using a splitter registered for the given language, returning a list of sentences and the name of the splitter used.
API Reference
For complete technical details on the SentenceSplitter class, check out the API documentation.