chunklet.sentence_splitter
Modules:
-
languages–This module contains the language sets for the supported sentence splitters.
-
registry– -
sentence_splitter–
Classes:
-
BaseSplitter–Base class for sentence splitting.
-
CallbackError–Raised when a callback function provided to chunker
-
CustomSplitterRegistry– -
SentenceSplitter–A robust and versatile utility dedicated to precisely segmenting text into individual sentences.
-
UniversalSplitter–Language-agnostic sentence boundary detector using regex patterns.
Functions:
-
deprecated_callable–Decorate a function or class with warning message.
-
log_info–Log an info message if verbose is enabled.
-
pretty_errors–Formats Pydantic validation errors into a human-readable string.
-
read_text_file–Read text file with automatic encoding detection.
-
validate_input–A decorator that validates function inputs and outputs
BaseSplitter
Base class for sentence splitting. Defines the interface that all splitter implementations must adhere to.
Methods:
-
split–Split text into sentences.
-
split_text–Splits the given text into a list of sentences.
split
Split text into sentences.
Note
Deprecated since 2.2.0. Will be removed in 3.0.0. Use split_text instead.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
split_text
Splits the given text into a list of sentences.
Parameters:
-
(textstr) –The input text to be split.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto').
Returns:
-
list[str]–A list of sentences extracted from the text.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
CallbackError
Bases: ChunkletError
Raised when a callback function provided to chunker or splitter fails during execution.
CustomSplitterRegistry
Methods:
-
clear–Clears all registered splitters from the registry.
-
is_registered–Check if a splitter is registered for the given language.
-
register–Register a splitter callback for one or more languages.
-
split–Processes a text using a splitter registered for the given language.
-
unregister–Remove splitter(s) from the registry.
Attributes:
-
splitters–Returns a shallow copy of the dictionary of registered splitters.
splitters
property
Returns a shallow copy of the dictionary of registered splitters.
This prevents external modification of the internal registry state.
clear
is_registered
register
Register a splitter callback for one or more languages.
This method can be used in two ways:
-
As a decorator: @registry.register("en", "fr", name="my_splitter") def my_splitter(text): ...
-
As a direct function call: registry.register(my_splitter, "en", "fr", name="my_splitter")
Parameters:
-
(*argsAny, default:()) –The arguments, which can be either (lang1, lang2, ...) for a decorator or (callback, lang1, lang2, ...) for a direct call.
-
(namestr | None, default:None) –The name of the splitter. If None, attempts to use the callback's name.
Source code in src/chunklet/sentence_splitter/registry.py
split
Processes a text using a splitter registered for the given language.
Parameters:
-
(textstr) –The text to split.
-
(langstr) –The language of the text.
Returns:
-
tuple[list[str], str]–A tuple containing a list of sentences and the name of the splitter used.
Raises:
-
CallbackError–If the splitter callback fails.
-
TypeError–If the splitter returns the wrong type.
Examples:
>>> from chunklet.sentence_splitter import CustomSplitterRegistry
>>> registry = CustomSplitterRegistry()
>>> @registry.register("xx", name="custom_splitter")
... def custom_splitter(text: str) -> list[str]:
... return text.split(" ")
>>> registry.split("Hello World", "xx")
(['Hello', 'World'], 'custom_splitter')
Source code in src/chunklet/sentence_splitter/registry.py
unregister
Remove splitter(s) from the registry.
Parameters:
-
(*langsstr, default:()) –Language codes to remove
SentenceSplitter
Bases: BaseSplitter
A robust and versatile utility dedicated to precisely segmenting text into individual sentences.
Key Features: - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Custom Splitters: Uses centralized registry for custom splitting logic. - Fallback Mechanism: Employs a universal rule-based splitter for unsupported languages. - Robust Error Handling: Provides clear error reporting for issues with custom splitters. - Intelligent Post-processing: Cleans up split sentences by filtering empty strings and rejoining stray punctuation.
Initializes the SentenceSplitter.
Parameters:
-
(verbosebool, default:False) –If True, enables verbose logging for debugging and informational messages.
Methods:
-
detected_top_language–Detects the top language of the given text using py3langid.
-
split_file–Read and split a file into sentences.
-
split_text–Splits a given text into a list of sentences.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
detected_top_language
Detects the top language of the given text using py3langid.
Parameters:
-
(textstr) –The input text to detect the language for.
Returns:
-
tuple[str, float]–A tuple containing the detected language code and its confidence.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
split_file
Read and split a file into sentences.
Parameters:
-
(pathstr | Path) –Path to the file to read.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to 'auto'.
Returns:
-
list[str]–A list of sentences extracted from the file.
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
split_text
Splits a given text into a list of sentences.
Parameters:
-
(textstr) –The input text to be split.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'
Returns:
-
list[str]–A list of sentences.
Examples:
>>> splitter = SentenceSplitter()
>>> splitter.split_text("Hello world. How are you?", "en")
['Hello world.', 'How are you?']
>>> splitter.split_text("Bonjour le monde. Comment allez-vous?", "fr")
['Bonjour le monde.', 'Comment allez-vous?']
>>> splitter.split_text("Hello world. How are you?", "auto")
['Hello world.', 'How are you?']
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
UniversalSplitter
Language-agnostic sentence boundary detector using regex patterns.
A universal splitter using Unicode-aware regex patterns for any language.
Handles
- Unicode sentence terminators
- Numbered lists and headings
- Quoted sentences
- Line breaks and whitespace
Use cases
- Primary splitter for languages without dedicated support
- Fallback when language-specific splitters unavailable
Methods:
-
split–Splits text into sentences using rule-based regex patterns.
Source code in src/chunklet/sentence_splitter/_universal_splitter.py
split
Splits text into sentences using rule-based regex patterns.
Parameters:
-
(textstr) –The input text to be segmented into sentences.
Returns:
-
list[str]–A list of sentences after segmentation.
Source code in src/chunklet/sentence_splitter/_universal_splitter.py
deprecated_callable
Decorate a function or class with warning message.
This decorator marks a function or class as deprecated.
Parameters:
-
(use_insteadstr) –Replacement name (e.g., "split_text", "DocumentChunker", or "chunk_text or chunk_file").
-
(deprecated_instr) –Version when the function was deprecated (e.g., "2.2.0").
-
(removed_instr) –Version when the function will be removed (e.g., "3.0.0").
Returns:
-
Callable–Decorator function that wraps the source function/class.
Source code in src/chunklet/common/deprecation.py
log_info
Log an info message if verbose is enabled.
This is a convenience function that only logs when verbose mode is enabled, avoiding unnecessary log output in production.
Parameters:
-
(verbosebool) –If True, logs the message; if False, does nothing.
-
–*argsPositional arguments passed to logger.info().
-
–**kwargsKeyword arguments passed to logger.info().
Example
log_info(True, "Processing file: {}", filepath) Processing file: /path/to/file log_info(False, "This will not be logged") (no output)
Source code in src/chunklet/common/logging_utils.py
pretty_errors
Formats Pydantic validation errors into a human-readable string.
Source code in src/chunklet/common/validation.py
read_text_file
Read text file with automatic encoding detection.
Parameters:
-
(pathstr | Path) –File path to read.
Returns:
-
str–File content.
Raises:
-
FileProcessingError–If file cannot be read.
Source code in src/chunklet/common/path_utils.py
validate_input
A decorator that validates function inputs and outputs
A wrapper around Pydantic's validate_call that catchesValidationError and re-raises it as a more user-friendly InvalidInputError.