chunklet.sentence_splitter

Modules:

languages –

This module contains the language sets for the supported sentence splitters.
registry –
sentence_splitter –

Classes:

BaseSplitter –

Base class for sentence splitting.
CallbackError –

Raised when a callback function provided to chunker
CustomSplitterRegistry –
SentenceSplitter –

A robust and versatile utility dedicated to precisely segmenting text into individual sentences.
UniversalSplitter –

Language-agnostic sentence boundary detector using regex patterns.

Functions:

deprecated_callable –

Decorate a function or class with warning message.
log_info –

Log an info message if verbose is enabled.
pretty_errors –

Formats Pydantic validation errors into a human-readable string.
read_text_file –

Read text file with automatic encoding detection.
validate_input –

A decorator that validates function inputs and outputs

BaseSplitter

Base class for sentence splitting. Defines the interface that all splitter implementations must adhere to.

Methods:

split –

Split text into sentences.
split_text –

Splits the given text into a list of sentences.

split

split(text: str, lang: str = 'auto') -> list[str]

Split text into sentences.

Note

Deprecated since 2.2.0. Will be removed in 3.0.0. Use split_text instead.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py

@deprecated_callable(
    use_instead="split_text", deprecated_in="2.2.0", removed_in="3.0.0"
)
def split(self, text: str, lang: str = "auto") -> list[str]:  # pragma: no cover
    """
    Split text into sentences.

    Note:
        Deprecated since 2.2.0. Will be removed in 3.0.0. Use `split_text` instead.
    """
    return self.split_text(text, lang)

split_text

split_text(text: str, lang: str = 'auto') -> list[str]

Splits the given text into a list of sentences.

Parameters:

text
(str) –

The input text to be split.
lang
(str, default: 'auto' ) –

The language of the text (e.g., 'en', 'fr', 'auto').

Returns:

list[str] –

A list of sentences extracted from the text.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py

def split_text(self, text: str, lang: str = "auto") -> list[str]:
    """Splits the given text into a list of sentences.

    Args:
        text: The input text to be split.
        lang: The language of the text (e.g., 'en', 'fr', 'auto').

    Returns:
        A list of sentences extracted from the text.
    """
    raise NotImplementedError("Subclasses must implement 'split_text'.")

CallbackError

Bases: ChunkletError

Raised when a callback function provided to chunker or splitter fails during execution.

CustomSplitterRegistry

Methods:

clear –

Clears all registered splitters from the registry.
is_registered –

Check if a splitter is registered for the given language.
register –

Register a splitter callback for one or more languages.
split –

Processes a text using a splitter registered for the given language.
unregister –

Remove splitter(s) from the registry.

Attributes:

splitters –

Returns a shallow copy of the dictionary of registered splitters.

splitters `property`

splitters

Returns a shallow copy of the dictionary of registered splitters.

This prevents external modification of the internal registry state.

clear

clear() -> None

Clears all registered splitters from the registry.

Source code in src/chunklet/sentence_splitter/registry.py

def clear(self) -> None:
    """
    Clears all registered splitters from the registry.
    """
    self._splitters.clear()

is_registered

is_registered(lang: str) -> bool

Check if a splitter is registered for the given language.

Source code in src/chunklet/sentence_splitter/registry.py

@validate_input
def is_registered(self, lang: str) -> bool:
    """
    Check if a splitter is registered for the given language.
    """
    return lang in self._splitters

register

register(*args: Any, name: str | None = None)

Register a splitter callback for one or more languages.

This method can be used in two ways:

As a decorator: @registry.register("en", "fr", name="my_splitter") def my_splitter(text): ...
As a direct function call: registry.register(my_splitter, "en", "fr", name="my_splitter")

Parameters:

*args
(Any, default: () ) –

The arguments, which can be either (lang1, lang2, ...) for a decorator or (callback, lang1, lang2, ...) for a direct call.
name
(str | None, default: None ) –

The name of the splitter. If None, attempts to use the callback's name.

Source code in src/chunklet/sentence_splitter/registry.py

def register(self, *args: Any, name: str | None = None):
    """
    Register a splitter callback for one or more languages.

    This method can be used in two ways:

    1. As a decorator:
        @registry.register("en", "fr", name="my_splitter")
        def my_splitter(text):
            ...

    2. As a direct function call:
        registry.register(my_splitter, "en", "fr", name="my_splitter")

    Args:
        *args: The arguments, which can be either (lang1, lang2, ...) for a decorator
               or (callback, lang1, lang2, ...) for a direct call.
        name: The name of the splitter. If None, attempts to use the callback's name.
    """
    if not args:
        raise ValueError("At least one language or a callback must be provided.")

    if callable(args[0]):
        # Direct call: register(callback, lang1, lang2, ...)
        callback = args[0]
        langs = args[1:]
        if not langs:
            raise ValueError(
                "At least one language must be provided for the callback."
            )
        self._register_logic(langs, callback, name)
        return callback
    else:
        # Decorator: @register(lang1, lang2, ...)
        langs = args

        def decorator(cb: Callable):
            self._register_logic(langs, cb, name)
            return cb

        return decorator

split

split(text: str, lang: str) -> tuple[list[str], str]

Processes a text using a splitter registered for the given language.

Parameters:

text
(str) –

The text to split.
lang
(str) –

The language of the text.

Returns:

tuple[list[str], str] –

A tuple containing a list of sentences and the name of the splitter used.

Raises:

CallbackError –

If the splitter callback fails.
TypeError –

If the splitter returns the wrong type.

Examples:

>>> from chunklet.sentence_splitter import CustomSplitterRegistry
>>> registry = CustomSplitterRegistry()
>>> @registry.register("xx", name="custom_splitter")
... def custom_splitter(text: str) -> list[str]:
...     return text.split(" ")
>>> registry.split("Hello World", "xx")
(['Hello', 'World'], 'custom_splitter')

Source code in src/chunklet/sentence_splitter/registry.py

@validate_input
def split(self, text: str, lang: str) -> tuple[list[str], str]:
    """
    Processes a text using a splitter registered for the given language.

    Args:
        text: The text to split.
        lang: The language of the text.

    Returns:
        A tuple containing a list of sentences and the name of the splitter used.

    Raises:
        CallbackError: If the splitter callback fails.
        TypeError: If the splitter returns the wrong type.

    Examples:
        >>> from chunklet.sentence_splitter import CustomSplitterRegistry
        >>> registry = CustomSplitterRegistry()
        >>> @registry.register("xx", name="custom_splitter")
        ... def custom_splitter(text: str) -> list[str]:
        ...     return text.split(" ")
        >>> registry.split("Hello World", "xx")
        (['Hello', 'World'], 'custom_splitter')
    """
    splitter_info = self._splitters.get(lang)
    if not splitter_info:
        raise CallbackError(
            f"No splitter registered for language '{lang}'.\n"
            f"💡Hint: Use `.register('{lang}', fn=your_function)` first."
        )

    name, callback = splitter_info

    try:
        # Validate the return type
        result = callback(text)
        validator = TypeAdapter(list[str])
        validator.validate_python(result)
    except ValidationError as e:
        e.subtitle = f"{name} result"
        e.hint = "💡Hint: Make sure your splitter returns a list of strings."
        raise CallbackError(f"{pretty_errors(e)}.\n") from None
    except Exception as e:
        raise CallbackError(
            f"Splitter '{name}' for lang '{lang}' raised an exception.\nDetails: {e}"
        ) from None

    return result, name

unregister

unregister(*langs: str) -> None

Remove splitter(s) from the registry.

Parameters:

*langs
(str, default: () ) –

Language codes to remove

Source code in src/chunklet/sentence_splitter/registry.py

@validate_input
def unregister(self, *langs: str) -> None:
    """
    Remove splitter(s) from the registry.

    Args:
        *langs: Language codes to remove
    """
    for lang in langs:
        self._splitters.pop(lang, None)

SentenceSplitter

SentenceSplitter(verbose: bool = False)

Bases: BaseSplitter

A robust and versatile utility dedicated to precisely segmenting text into individual sentences.

Key Features: - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Custom Splitters: Uses centralized registry for custom splitting logic. - Fallback Mechanism: Employs a universal rule-based splitter for unsupported languages. - Robust Error Handling: Provides clear error reporting for issues with custom splitters. - Intelligent Post-processing: Cleans up split sentences by filtering empty strings and rejoining stray punctuation.

Initializes the SentenceSplitter.

Parameters:

verbose
(bool, default: False ) –

If True, enables verbose logging for debugging and informational messages.

Methods:

detected_top_language –

Detects the top language of the given text using py3langid.
split_file –

Read and split a file into sentences.
split_text –

Splits a given text into a list of sentences.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py

@validate_input
def __init__(self, verbose: bool = False):
    """
    Initializes the SentenceSplitter.

    Args:
        verbose: If True, enables verbose logging for debugging and informational messages.
    """
    self.verbose = verbose
    self.fallback_splitter = UniversalSplitter()

    # Create a normalized identifier for language detection
    self._identifier = LanguageIdentifier.from_pickled_model(
        MODEL_FILE, norm_probs=True
    )

    # Tracked to reduce log spamming about language detection
    self._last_lang_used = None

detected_top_language

detected_top_language(text: str) -> tuple[str, float]

Detects the top language of the given text using py3langid.

Parameters:

text
(str) –

The input text to detect the language for.

Returns:

tuple[str, float] –

A tuple containing the detected language code and its confidence.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py

@validate_input
def detected_top_language(self, text: str) -> tuple[str, float]:
    """
    Detects the top language of the given text using py3langid.

    Args:
        text: The input text to detect the language for.

    Returns:
        A tuple containing the detected language code and its confidence.
    """
    lang_detected, confidence = self._identifier.classify(text)
    log_info(
        self.verbose,
        "Language detection: '{}' with confidence {}.",
        lang_detected,
        f"{round(confidence) * 10}/10",
    )
    return lang_detected, confidence

split_file

split_file(
    path: str | Path, lang: str = "auto"
) -> list[str]

Read and split a file into sentences.

Parameters:

path
(str | Path) –

Path to the file to read.
lang
(str, default: 'auto' ) –

The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to 'auto'.

Returns:

list[str] –

A list of sentences extracted from the file.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py

def split_file(self, path: str | Path, lang: str = "auto") -> list[str]:
    """
    Read and split a file into sentences.

    Args:
        path: Path to the file to read.
        lang: The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to 'auto'.

    Returns:
        A list of sentences extracted from the file.
    """
    content = read_text_file(path)
    return self.split_text(content, lang)

split_text

split_text(text: str, lang: str = 'auto') -> list[str]

Splits a given text into a list of sentences.

Parameters:

text
(str) –

The input text to be split.
lang
(str, default: 'auto' ) –

The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'

Returns:

list[str] –

A list of sentences.

Examples:

>>> splitter = SentenceSplitter()
>>> splitter.split_text("Hello world. How are you?", "en")
['Hello world.', 'How are you?']
>>> splitter.split_text("Bonjour le monde. Comment allez-vous?", "fr")
['Bonjour le monde.', 'Comment allez-vous?']
>>> splitter.split_text("Hello world. How are you?", "auto")
['Hello world.', 'How are you?']

Source code in src/chunklet/sentence_splitter/sentence_splitter.py

@validate_input
def split_text(self, text: str, lang: str = "auto") -> list[str]:
    """
    Splits a given text into a list of sentences.

    Args:
        text: The input text to be split.
        lang: The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'

    Returns:
        A list of sentences.

    Examples:
        >>> splitter = SentenceSplitter()
        >>> splitter.split_text("Hello world. How are you?", "en")
        ['Hello world.', 'How are you?']
        >>> splitter.split_text("Bonjour le monde. Comment allez-vous?", "fr")
        ['Bonjour le monde.', 'Comment allez-vous?']
        >>> splitter.split_text("Hello world. How are you?", "auto")
        ['Hello world.', 'How are you?']
    """
    if not text:
        log_info(self.verbose, "Input text is empty. Returning empty list.")
        return []

    if lang == "auto":
        if self._last_lang_used is None:
            logger.warning(
                "The language is set to `auto`. Consider setting the `lang` parameter "
                "to a specific language to improve reliability."
            )
        lang_detected, confidence = self.detected_top_language(text)
        lang = lang_detected if confidence >= 0.7 else "fallback"

    self._last_lang_used = lang

    sentences = None
    if lang != "fallback":
        # Prioritize custom splitters from registry
        if custom_splitter_registry.is_registered(lang):
            sentences, splitter_name = custom_splitter_registry.split(text, lang)
            log_info(self.verbose, "Using registered splitter: {}", splitter_name)
        elif (handler := self._get_special_lang_handler(lang, self.verbose)) is not None:
            sentences = handler(text)

    # If no handler found, use fallback
    if sentences is None:
        logger.warning(
            "Using a universal rule-based splitter.\n"
            "Reason: Language not supported or detected with low confidence."
        )
        sentences = self.fallback_splitter.split(text)

    cleaned_sentences = self._clean_sentences(sentences)
    log_info(
        self.verbose,
        "Text splitted into sentences. Total sentences detected: {}",
        len(cleaned_sentences),
    )
    return cleaned_sentences

UniversalSplitter

UniversalSplitter()

Language-agnostic sentence boundary detector using regex patterns.

A universal splitter using Unicode-aware regex patterns for any language.

Handles

Unicode sentence terminators
Numbered lists and headings
Quoted sentences
Line breaks and whitespace

Use cases

Primary splitter for languages without dedicated support
Fallback when language-specific splitters unavailable

Methods:

split –

Splits text into sentences using rule-based regex patterns.

Source code in src/chunklet/sentence_splitter/_universal_splitter.py

def __init__(self):
    self.sentence_terminators = "".join(GLOBAL_SENTENCE_TERMINATORS)
    self.flattened_numbered_list_pattern = re.compile(
        rf"(?<=[{self.sentence_terminators}:])\s+(\p{{N}}\.)+"
    )

    self.quote_or_paren_pattern = re.compile(
        r"(\p{Pi}|['\"]).+?(\p{Pf}|\1)|"
        r"\p{Ps}.+?\p{Pe}",
        re.DOTALL,
    )

    self.hashed_pattern = re.compile(r"##-?\d+##")
    self.numbered_list_pattern = re.compile(r"[\n:]\s*\p{N}\.")

    # Core sentence split regex
    self.sentence_end_pattern = re.compile(
        rf"""
        (?<!\b(\p{{Lu}}\p{{Ll}}{{1, 4}}\.)*)   # Latin-only abbreviation
        (?<=[{self.sentence_terminators}])       # sentence-ending punctuation
        (?=\s+[\p{{Lu}}\p{{Lo}}\p{{Lt}}]|\s*\n|\s*$)  # followed by letter (upper or catch-all) or end
        """,
        re.VERBOSE,
    )

split

split(text: str) -> list[str]

Splits text into sentences using rule-based regex patterns.

Parameters:

text
(str) –

The input text to be segmented into sentences.

Returns:

list[str] –

A list of sentences after segmentation.

Source code in src/chunklet/sentence_splitter/_universal_splitter.py

def split(self, text: str) -> list[str]:
    """
    Splits text into sentences using rule-based regex patterns.

    Args:
        text: The input text to be segmented into sentences.

    Returns:
        A list of sentences after segmentation.
    """
    def mask(match: re.Match, norm_map: dict):
        # Generate the integer hash and Convert to string 
        # because re.sub MUST return a string
        # Also fence them for easy detection
        hashed_str = f"##{hash(match.group())}##"

        # Store the mapping for later reconstruction
        norm_map[hashed_str] = match.group()
        return hashed_str

    def unmask(match: re.Match, norm_map: dict):
        return norm_map.get(match.group(), match.group())

    text = self.flattened_numbered_list_pattern.sub(r"\n \1", text.strip())

    # Normalize to protect them 
    norm_map = {}
    text = self.quote_or_paren_pattern.sub(
        lambda m: mask(m, norm_map), text
    )
    text = self.numbered_list_pattern.sub(
        lambda m: mask(m, norm_map), text
    )

    # Firstly, split base on punctuation
    # then split further on newline
    final_sentences = []
    sentences = self.sentence_end_pattern.split(text.strip())
    for sent in sentences:
        if sent:
            final_sentences.extend(sent.strip().splitlines())

    # Restore the normalization
    return [
        self.hashed_pattern.sub(lambda m: unmask(m, norm_map), sent)
        for sent in final_sentences if sent.strip()
    ]

deprecated_callable

deprecated_callable(
    use_instead: str, deprecated_in: str, removed_in: str
) -> Callable

Decorate a function or class with warning message.

This decorator marks a function or class as deprecated.

Parameters:

use_instead
(str) –

Replacement name (e.g., "split_text", "DocumentChunker", or "chunk_text or chunk_file").
deprecated_in
(str) –

Version when the function was deprecated (e.g., "2.2.0").
removed_in
(str) –

Version when the function will be removed (e.g., "3.0.0").

Returns:

Callable –

Decorator function that wraps the source function/class.

Source code in src/chunklet/common/deprecation.py

def deprecated_callable(
    use_instead: str,
    deprecated_in: str,
    removed_in: str,
) -> Callable:
    """Decorate a function or class with warning message.

    This decorator marks a function or class as deprecated.

    Args:
        use_instead: Replacement name (e.g., "split_text", "DocumentChunker", or "chunk_text or chunk_file").
        deprecated_in: Version when the function was deprecated (e.g., "2.2.0").
        removed_in: Version when the function will be removed (e.g., "3.0.0").

    Returns:
        Decorator function that wraps the source function/class.
    """

    def decorator(func_or_cls: Callable) -> Callable:
        warn_message = (
            f"`{func_or_cls.__qualname__}` was deprecated since v{deprecated_in} "
            f"in favor of `{use_instead}`. It will be removed in v{removed_in}."
        )
        remove_message = (
            f"`{func_or_cls.__qualname__}` was removed in v{removed_in}. "
            f"Use `{use_instead}` instead."
        )

        @functools.wraps(func_or_cls)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            if Version(CURRENT_VERSION) >= Version(removed_in):
                raise AttributeError(remove_message)
            warnings.warn(warn_message, FutureWarning, stacklevel=2)
            return func_or_cls(*args, **kwargs)

        return wrapper

    return decorator

log_info

log_info(verbose: bool, *args, **kwargs) -> None

Log an info message if verbose is enabled.

This is a convenience function that only logs when verbose mode is enabled, avoiding unnecessary log output in production.

Parameters:

verbose
(bool) –

If True, logs the message; if False, does nothing.
*args
–

Positional arguments passed to logger.info().
**kwargs
–

Keyword arguments passed to logger.info().

Example

log_info(True, "Processing file: {}", filepath) Processing file: /path/to/file log_info(False, "This will not be logged") (no output)

Source code in src/chunklet/common/logging_utils.py

def log_info(verbose: bool, *args, **kwargs) -> None:
    """Log an info message if verbose is enabled.

    This is a convenience function that only logs when verbose mode is enabled,
    avoiding unnecessary log output in production.

    Args:
        verbose: If True, logs the message; if False, does nothing.
        *args: Positional arguments passed to logger.info().
        **kwargs: Keyword arguments passed to logger.info().

    Example:
        >>> log_info(True, "Processing file: {}", filepath)
        Processing file: /path/to/file
        >>> log_info(False, "This will not be logged")
        (no output)
    """
    if verbose:
        logger.info(*args, **kwargs)

pretty_errors

pretty_errors(error: ValidationError) -> str

Formats Pydantic validation errors into a human-readable string.

Source code in src/chunklet/common/validation.py

def pretty_errors(error: ValidationError) -> str:
    """Formats Pydantic validation errors into a human-readable string."""
    lines = [
        f"{error.error_count()} validation error for {getattr(error, 'subtitle', '') or error.title}."
    ]
    for ind, err in enumerate(error.errors(), start=1):
        msg = err["msg"]

        loc = err.get("loc", [])
        formatted_loc = ""
        if len(loc) >= 1:
            formatted_loc = str(loc[0]) + "".join(f"[{step!r}]" for step in loc[1:])
            formatted_loc = f"({formatted_loc})" if formatted_loc else ""

        input_value = err["input"]
        input_type = type(input_value).__name__

        # Use reprlib for auto-truncation on non-strings (faster for lists/dicts/nested)
        if not isinstance(input_value, str):
            input_value = reprlib.repr(input_value)
        else:
            input_value = input_value if len(input_value) < 500 else input_value[:500] + "..."

        lines.append(
            (
                f"{ind}) {formatted_loc} {msg}.\n"
                f"  Found: (input={input_value!r}, type={input_type})"
            )
        )

    lines.append("  " + getattr(error, "hint", ""))
    return "\n".join(lines)

read_text_file

read_text_file(path: str | Path) -> str

Read text file with automatic encoding detection.

Parameters:

path
(str | Path) –

File path to read.

Returns:

str –

File content.

Raises:

FileProcessingError –

If file cannot be read.

Source code in src/chunklet/common/path_utils.py

@validate_input
def read_text_file(path: str | Path) -> str:
    """Read text file with automatic encoding detection.

    Args:
        path: File path to read.

    Returns:
        File content.

    Raises:
        FileProcessingError: If file cannot be read.
    """
    from charset_normalizer import from_path

    path = Path(path)

    if not path.exists():
        raise FileProcessingError(f"File does not exist: {path}")

    if _is_binary_file(path):
        raise FileProcessingError(f"Binary file not supported: {path}")

    match = from_path(str(path)).best()
    return str(match) if match else ""

validate_input

validate_input(fn)

A decorator that validates function inputs and outputs

A wrapper around Pydantic's validate_call that catchesValidationError and re-raises it as a more user-friendly InvalidInputError.

Source code in src/chunklet/common/validation.py

def validate_input(fn):
    """
    A decorator that validates function inputs and outputs

    A wrapper around Pydantic's `validate_call` that catches`ValidationError` and re-raises it as a more user-friendly `InvalidInputError`.
    """
    validated_fn = validate_call(fn, config=ConfigDict(arbitrary_types_allowed=True))

    @wraps(fn)
    def wrapper(*args, **kwargs):
        try:
            return validated_fn(*args, **kwargs)
        except ValidationError as e:
            raise InvalidInputError(pretty_errors(e)) from None

    return wrapper

chunklet.sentence_splitter

BaseSplitter

split

split_text

text

lang

CallbackError

CustomSplitterRegistry

splitters property

clear

is_registered

register

*args

name

split

text

lang

unregister

*langs

SentenceSplitter

verbose

detected_top_language

text

split_file

path

lang

split_text

text

lang

UniversalSplitter

split

text

deprecated_callable

use_instead

deprecated_in

removed_in

log_info

verbose

*args

**kwargs

pretty_errors

read_text_file

path

validate_input

`text`

`lang`

splitters `property`

**`*args`**

`name`

`text`

`lang`

**`*langs`**

`verbose`

`text`

`path`

`lang`

`text`

`lang`

`text`

`use_instead`

`deprecated_in`

`removed_in`

`verbose`

**`*args`**

`kwargs`**

`path`