chunklet.sentence_splitter
Modules:
-
languages–This module contains the language sets for the supported sentence splitters.
-
registry– -
sentence_splitter–
Classes:
-
BaseSplitter–Abstract base class for sentence splitting.
-
CallbackError–Raised when a callback function provided to chunker
-
CustomSplitterRegistry– -
FallbackSplitter–Rule-based, language-agnostic sentence boundary detector.
-
SentenceSplitter–A robust and versatile utility dedicated to precisely segmenting text into individual sentences.
Functions:
-
pretty_errors–Formats Pydantic validation errors into a human-readable string.
-
validate_input–A decorator that validates function inputs and outputs
BaseSplitter
Bases: ABC
Abstract base class for sentence splitting. Defines the interface that all sentence splitter implementations must adhere to.
Methods:
-
split–Splits the given text into a list of sentences.
split
abstractmethod
Splits the given text into a list of sentences.
text (str): The input text to be split. lang (str): The language of the text (e.g., 'en', 'fr', 'auto').
Returns:
-
list[str]–list[str]: A list of sentences extracted from the text.
Examples:
>>> class MySplitter(BaseSplitter):
... def split(self, text: str, lang: str) -> list[str]:
... return text.split(".")
>>> splitter = MySplitter()
>>> splitter.split("Hello. World.", "en")
['Hello', ' World']
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
CallbackError
Bases: ChunkletError
Raised when a callback function provided to chunker or splitter fails during execution.
CustomSplitterRegistry
Methods:
-
clear–Clears all registered splitters from the registry.
-
is_registered–Check if a splitter is registered for the given language.
-
register–Register a splitter callback for one or more languages.
-
split–Processes a text using a splitter registered for the given language.
-
unregister–Remove splitter(s) from the registry.
Attributes:
-
splitters–Returns a shallow copy of the dictionary of registered splitters.
splitters
property
Returns a shallow copy of the dictionary of registered splitters.
This prevents external modification of the internal registry state.
clear
is_registered
register
Register a splitter callback for one or more languages.
This method can be used in two ways: 1. As a decorator: @registry.register("en", "fr", name="my_splitter") def my_splitter(text): ...
- As a direct function call: registry.register(my_splitter, "en", "fr", name="my_splitter")
Parameters:
-
(*argsAny, default:()) –The arguments, which can be either (lang1, lang2, ...) for a decorator or (callback, lang1, lang2, ...) for a direct call.
-
(namestr, default:None) –The name of the splitter. If None, attempts to use the callback's name.
Source code in src/chunklet/sentence_splitter/registry.py
split
Processes a text using a splitter registered for the given language.
Parameters:
-
(textstr) –The text to split.
-
(langstr) –The language of the text.
Returns:
-
tuple[list[str], str]–tuple[list[str], str]: A tuple containing a list of sentences and the name of the splitter used.
Raises:
-
CallbackError–If the splitter callback fails.
-
TypeError–If the splitter returns the wrong type.
Examples:
>>> from chunklet.sentence_splitter import CustomSplitterRegistry
>>> registry = CustomSplitterRegistry()
>>> @registry.register("xx", name="custom_splitter")
... def custom_splitter(text: str) -> list[str]:
... return text.split(" ")
>>> registry.split("Hello World", "xx")
(['Hello', 'World'], 'custom_splitter')
Source code in src/chunklet/sentence_splitter/registry.py
unregister
Remove splitter(s) from the registry.
Parameters:
-
(*langsstr, default:()) –Language codes to remove
FallbackSplitter
Rule-based, language-agnostic sentence boundary detector.
A rule-based, sentence boundary detection tool that doesn't rely on hardcoded lists of abbreviations or sentence terminators, making it adaptable to various text formats and domains.
FallbackSplitter uses regex patterns to split text into sentences, handling: - Common sentence-ending punctuation (., !, ?) - Abbreviations and acronyms (e.g., Dr., Ph.D., U.S.) - Numbered lists and headings - Multi-punctuation sequences (e.g., ! ! !, ?!) - Line breaks and whitespace normalization - Decimal numbers and inline numbers
Sentences are conservatively segmented, prioritizing context over aggressive splitting, which reduces false splits inside abbreviations, multi-punctuation sequences, or numeric constructs.
Initializes regex patterns for sentence splitting.
Methods:
-
split–Splits text into sentences using rule-based regex patterns.
Source code in src/chunklet/sentence_splitter/_fallback_splitter.py
split
Splits text into sentences using rule-based regex patterns.
Parameters:
-
(textstr) –The input text to be segmented into sentences.
Returns:
-
List[str]–List[str]: A list of sentences after segmentation.
Notes
- Normalizes numbered lists during splitting and restores them afterward.
- Handles punctuation, newlines, and common edge cases.
Source code in src/chunklet/sentence_splitter/_fallback_splitter.py
SentenceSplitter
Bases: BaseSplitter
A robust and versatile utility dedicated to precisely segmenting text into individual sentences.
Key Features: - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Custom Splitters: Uses centralized registry for custom splitting logic. - Fallback Mechanism: Employs a universal rule-based splitter for unsupported languages. - Robust Error Handling: Provides clear error reporting for issues with custom splitters. - Intelligent Post-processing: Cleans up split sentences by filtering empty strings and rejoining stray punctuation.
Initializes the SentenceSplitter.
Parameters:
-
(verbosebool, default:False) –If True, enables verbose logging for debugging and informational messages.
Methods:
-
detected_top_language–Detects the top language of the given text using py3langid.
-
split–Splits a given text into a list of sentences.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
detected_top_language
Detects the top language of the given text using py3langid.
Parameters:
-
(textstr) –The input text to detect the language for.
Returns:
-
tuple[str, float]–tuple[str, float]: A tuple containing the detected language code and its confidence.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
split
Splits a given text into a list of sentences.
Parameters:
-
(textstr) –The input text to be split.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'
Returns:
-
list[str]–list[str]: A list of sentences.
Examples:
>>> splitter = SentenceSplitter()
>>> splitter.split("Hello world. How are you?", "en")
['Hello world.', 'How are you?']
>>> splitter.split("Bonjour le monde. Comment allez-vous?", "fr")
['Bonjour le monde.', 'Comment allez-vous?']
>>> splitter.split("Hello world. How are you?", "auto")
['Hello world.', 'How are you?']
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
pretty_errors
Formats Pydantic validation errors into a human-readable string.
Source code in src/chunklet/common/validation.py
validate_input
A decorator that validates function inputs and outputs
A wrapper around Pydantic's validate_call that catchesValidationError and re-raises it as a more user-friendly InvalidInputError.