chunklet.sentence_splitter
Modules:
-
languages–This module contains the language sets for the supported sentence splitters.
-
registry– -
sentence_splitter–
Classes:
-
BaseSplitter–Base class for sentence splitting.
-
CallbackError–Raised when a callback function provided to chunker
-
CustomSplitterRegistry– -
FallbackSplitter–Rule-based, language-agnostic sentence boundary detector.
-
SentenceSplitter–A robust and versatile utility dedicated to precisely segmenting text into individual sentences.
Functions:
-
deprecated_callable–Decorate a function or class with warning message.
-
log_info–Log an info message if verbose is enabled.
-
pretty_errors–Formats Pydantic validation errors into a human-readable string.
-
read_text_file–Read text file with automatic encoding detection.
-
validate_input–A decorator that validates function inputs and outputs
BaseSplitter
Base class for sentence splitting. Defines the interface that all splitter implementations must adhere to.
Methods:
-
split–Split text into sentences.
-
split_text–Splits the given text into a list of sentences.
split
Split text into sentences.
Note
Deprecated since 2.2.0. Will be removed in 3.0.0. Use split_text instead.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
split_text
Splits the given text into a list of sentences.
Parameters:
-
(textstr) –The input text to be split.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto').
Returns:
-
list[str]–list[str]: A list of sentences extracted from the text.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
CallbackError
Bases: ChunkletError
Raised when a callback function provided to chunker or splitter fails during execution.
CustomSplitterRegistry
Methods:
-
clear–Clears all registered splitters from the registry.
-
is_registered–Check if a splitter is registered for the given language.
-
register–Register a splitter callback for one or more languages.
-
split–Processes a text using a splitter registered for the given language.
-
unregister–Remove splitter(s) from the registry.
Attributes:
-
splitters–Returns a shallow copy of the dictionary of registered splitters.
splitters
property
Returns a shallow copy of the dictionary of registered splitters.
This prevents external modification of the internal registry state.
clear
is_registered
register
Register a splitter callback for one or more languages.
This method can be used in two ways:
-
As a decorator: @registry.register("en", "fr", name="my_splitter") def my_splitter(text): ...
-
As a direct function call: registry.register(my_splitter, "en", "fr", name="my_splitter")
Parameters:
-
(*argsAny, default:()) –The arguments, which can be either (lang1, lang2, ...) for a decorator or (callback, lang1, lang2, ...) for a direct call.
-
(namestr, default:None) –The name of the splitter. If None, attempts to use the callback's name.
Source code in src/chunklet/sentence_splitter/registry.py
split
Processes a text using a splitter registered for the given language.
Parameters:
-
(textstr) –The text to split.
-
(langstr) –The language of the text.
Returns:
-
tuple[list[str], str]–tuple[list[str], str]: A tuple containing a list of sentences and the name of the splitter used.
Raises:
-
CallbackError–If the splitter callback fails.
-
TypeError–If the splitter returns the wrong type.
Examples:
>>> from chunklet.sentence_splitter import CustomSplitterRegistry
>>> registry = CustomSplitterRegistry()
>>> @registry.register("xx", name="custom_splitter")
... def custom_splitter(text: str) -> list[str]:
... return text.split(" ")
>>> registry.split("Hello World", "xx")
(['Hello', 'World'], 'custom_splitter')
Source code in src/chunklet/sentence_splitter/registry.py
unregister
Remove splitter(s) from the registry.
Parameters:
-
(*langsstr, default:()) –Language codes to remove
FallbackSplitter
Rule-based, language-agnostic sentence boundary detector.
A rule-based, sentence boundary detection tool that doesn't rely on hardcoded lists of abbreviations or sentence terminators, making it adaptable to various text formats and domains.
FallbackSplitter uses regex patterns to split text into sentences, handling: - Common sentence-ending punctuation (., !, ?) - Abbreviations and acronyms (e.g., Dr., Ph.D., U.S.) - Numbered lists and headings - Multi-punctuation sequences (e.g., ! ! !, ?!) - Line breaks and whitespace normalization - Decimal numbers and inline numbers
Sentences are conservatively segmented, prioritizing context over aggressive splitting, which reduces false splits inside abbreviations, multi-punctuation sequences, or numeric constructs.
Initializes regex patterns for sentence splitting.
Methods:
-
split–Splits text into sentences using rule-based regex patterns.
Source code in src/chunklet/sentence_splitter/_fallback_splitter.py
split
Splits text into sentences using rule-based regex patterns.
Parameters:
-
(textstr) –The input text to be segmented into sentences.
Returns:
-
list[str]–list[str]: A list of sentences after segmentation.
Notes
- Normalizes numbered lists during splitting and restores them afterward.
- Handles punctuation, newlines, and common edge cases.
Source code in src/chunklet/sentence_splitter/_fallback_splitter.py
SentenceSplitter
Bases: BaseSplitter
A robust and versatile utility dedicated to precisely segmenting text into individual sentences.
Key Features: - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Custom Splitters: Uses centralized registry for custom splitting logic. - Fallback Mechanism: Employs a universal rule-based splitter for unsupported languages. - Robust Error Handling: Provides clear error reporting for issues with custom splitters. - Intelligent Post-processing: Cleans up split sentences by filtering empty strings and rejoining stray punctuation.
Initializes the SentenceSplitter.
Parameters:
-
(verbosebool, default:False) –If True, enables verbose logging for debugging and informational messages.
Methods:
-
detected_top_language–Detects the top language of the given text using py3langid.
-
split_file–Read and split a file into sentences.
-
split_text–Splits a given text into a list of sentences.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
detected_top_language
Detects the top language of the given text using py3langid.
Parameters:
-
(textstr) –The input text to detect the language for.
Returns:
-
tuple[str, float]–tuple[str, float]: A tuple containing the detected language code and its confidence.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
split_file
Read and split a file into sentences.
Parameters:
-
(pathstr | Path) –Path to the file to read.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to 'auto'.
Returns:
-
list[str]–list[str]: A list of sentences extracted from the file.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
split_text
Splits a given text into a list of sentences.
Parameters:
-
(textstr) –The input text to be split.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'
Returns:
-
list[str]–list[str]: A list of sentences.
Examples:
>>> splitter = SentenceSplitter()
>>> splitter.split_text("Hello world. How are you?", "en")
['Hello world.', 'How are you?']
>>> splitter.split_text("Bonjour le monde. Comment allez-vous?", "fr")
['Bonjour le monde.', 'Comment allez-vous?']
>>> splitter.split_text("Hello world. How are you?", "auto")
['Hello world.', 'How are you?']
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
deprecated_callable
Decorate a function or class with warning message.
This decorator marks a function or class as deprecated.
Parameters:
-
(use_insteadstr) –Replacement name (e.g., "split_text", "DocumentChunker", or "chunk_text or chunk_file").
-
(deprecated_instr) –Version when the function was deprecated (e.g., "2.2.0").
-
(removed_instr) –Version when the function will be removed (e.g., "3.0.0").
Returns:
-
Callable(Callable) –Decorator function that wraps the source function/class.
Source code in src/chunklet/common/deprecation.py
log_info
Log an info message if verbose is enabled.
This is a convenience function that only logs when verbose mode is enabled, avoiding unnecessary log output in production.
Parameters:
-
(verbosebool) –If True, logs the message; if False, does nothing.
-
–*argsPositional arguments passed to logger.info().
-
–**kwargsKeyword arguments passed to logger.info().
Example
log_info(True, "Processing file: {}", filepath) Processing file: /path/to/file log_info(False, "This will not be logged") (no output)
Source code in src/chunklet/common/logging_utils.py
pretty_errors
Formats Pydantic validation errors into a human-readable string.
Source code in src/chunklet/common/validation.py
read_text_file
Read text file with automatic encoding detection.
Parameters:
-
(pathstr | Path) –File path to read.
Returns:
-
str(str) –File content.
Raises:
-
FileProcessingError–If file cannot be read.
Source code in src/chunklet/common/path_utils.py
validate_input
A decorator that validates function inputs and outputs
A wrapper around Pydantic's validate_call that catchesValidationError and re-raises it as a more user-friendly InvalidInputError.