Sentence Splitter
The Art of Precise Sentence Splitting ✂️
Let's be honest, simply splitting text by periods can be a bit like trying to perform delicate surgery with a butter knife – it often leads to more problems than solutions! This approach can result in sentences being cut mid-thought, abbreviations being misinterpreted, and a general lack of clarity that can leave your NLP models scratching their heads.
This common challenge in NLP, known as Sentence Boundary Disambiguation, is precisely what the SentenceSplitter is designed to address.
Imagine the SentenceSplitter as a skilled linguistic surgeon. It applies its understanding of grammar and context to make precise cuts, cleanly separating sentences while preserving their original meaning. It's intelligent, multilingual, and essential for preparing clean text data for NLP tasks, LLMs, and any application that needs accurate sentence boundaries.
What's Under the Hood? ⚙️
The SentenceSplitter is more than just a basic rule-based tool; it's a sophisticated system packed with powerful features:
- Multilingual Support 🌍: Handles over 50 languages with intelligent detection and language-specific splitting methods. Check our supported languages for the full list.
- Custom Splitters 🔧: Easily integrate your own custom sentence splitting functions for specialized languages or domains.
- Reliable Fallback 🛡️: For unsupported languages, a robust fallback mechanism ensures effective sentence splitting.
- Error Monitoring 🔍: Actively monitors for issues and provides clear feedback on custom splitter problems.
- Output Refinement ✨: Meticulously cleans the output, removing empty sentences and fixing punctuation issues.
Example Usage
Here's a quick example of how you can use the SentenceSplitter to split a block of text into sentences:
- Auto language detection: Let the splitter automatically detect the language of your text. For best results, specify a language code like
"en"or"fr"directly.
Click to show output
2025-11-02 16:27:29.277 | WARNING | chunklet.sentence_splitter.sentence_splitter:split:136 - The language is set to `auto`. Consider setting the `lang` parameter to a specific language to improve reliability.
2025-11-02 16:27:29.316 | INFO | chunklet.sentence_splitter.sentence_splitter:detected_top_language:109 - Language detection: 'en' with confidence 10/10.
2025-11-02 16:27:29.447 | INFO | chunklet.sentence_splitter.sentence_splitter:split:167 - Text splitted into sentences. Total sentences detected: 19
She loves cooking.
He studies AI.
"You are a Dr.", she said.
The weather is great.
We play chess.
Books are fun, aren't they?
The Playlist contains:
- two videos
- one image
- one music
Robots are learning.
It's raining.
Let's code.
Mars is red.
Sr. sleep is rare.
Consider item 1.
This is a test.
The year is 2025.
This is a good year since N.A.S.A. reached 123.4 light year more.
Detecting Top Languages 🎯
Here's how you can detect the top language of a given text using the SentenceSplitter:
Click to show output
Original language: en
Detected language: en with confidence 1.00
--------------------
Original language: fr
Detected language: fr with confidence 1.00
--------------------
Original language: es
Detected language: es with confidence 1.00
--------------------
Original language: de
Detected language: de with confidence 1.00
--------------------
Original language: hi
Detected language: hi with confidence 1.00
--------------------
Custom Sentence Splitter: Your Sentence Splitting Playground 🎨
Want to bring your own sentence splitting magic? You can plug in your custom splitter functions to Chunklet! Perfect for specialized languages or domains where you want to prioritize your custom logic over our built-in splitters.
Global Registry Alert!
Custom splitters get registered globally - once you add one, it's available everywhere in your app. Watch out for side effects if you're registering splitters across different parts of your codebase, especially in multi-threaded or long-running applications!
To use a custom splitter, you leverage the @registry.register decorator. This decorator allows you to register your function for one or more languages directly. Your custom splitter function must accept a single text parameter (str) and return a list[str] of sentences.
Custom Splitter Rules
- Your function must accept exactly one required parameter (the text)
- Optional parameters with defaults are totally fine
- Must return a list of strings
- Empty strings get filtered out automatically
- Lambda functions work if you provide a
nameparameter - Errors during splitting will raise a
CallbackError
Basic Custom Splitter
Click to show output
Multi-Language Custom Splitter
- This registers the same custom splitter for both French ("fr") and Spanish ("es") languages.
Unregistering Custom Splitters
- This will remove the custom splitter associated with the "en" language code. Note that you can unregister multiple languages if you had registered them with the same function:
registry.unregister("fr", "es")
Skip the Decorator?
Not a fan of decorators? No worries - you can directly use the registry.register() method. Super handy for dynamic registration or when your callback function isn't in the global scope.
Want to Build from Scratch?
Going full custom? Inherit from the BaseSplitter abstract class! It gives you a clear interface (def split(self, text: str, lang: str) -> list[str]) to implement. Your custom splitter will then work seamlessly with PlainTextChunker (docs) or DocumentChunker (docs).
CustomSplitterRegistry Methods Summary
splitters: Returns a shallow copy of the dictionary of registered splitters.is_registered(lang: str): Checks if a splitter is registered for the given language, returningTrueorFalse.register(callback: Callable[[str], list[str]] | None = None, *langs: str, name: str | None = None): Registers a splitter callback for one or more languages.unregister(*langs: str): Removes splitter(s) from the registry.clear(): Clears all registered splitters from the registry.split(text: str, lang: str): Processes a text using a splitter registered for the given language, returning a list of sentences and the name of the splitter used.
API Reference
For complete technical details on the SentenceSplitter class, check out the API documentation.