Skip to content

chunklet.sentence_splitter

Modules:

Classes:

Functions:

  • deprecated_callable

    Decorate a function or class with warning message.

  • log_info

    Log an info message if verbose is enabled.

  • pretty_errors

    Formats Pydantic validation errors into a human-readable string.

  • read_text_file

    Read text file with automatic encoding detection.

  • validate_input

    A decorator that validates function inputs and outputs

BaseSplitter

Base class for sentence splitting. Defines the interface that all splitter implementations must adhere to.

Methods:

  • split

    Split text into sentences.

  • split_text

    Splits the given text into a list of sentences.

split

split(text: str, lang: str = 'auto') -> list[str]

Split text into sentences.

Note

Deprecated since 2.2.0. Will be removed in 3.0.0. Use split_text instead.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@deprecated_callable(
    use_instead="split_text", deprecated_in="2.2.0", removed_in="3.0.0"
)
def split(self, text: str, lang: str = "auto") -> list[str]:  # pragma: no cover
    """
    Split text into sentences.

    Note:
        Deprecated since 2.2.0. Will be removed in 3.0.0. Use `split_text` instead.
    """
    return self.split_text(text, lang)

split_text

split_text(text: str, lang: str = 'auto') -> list[str]

Splits the given text into a list of sentences.

Parameters:

  • text

    (str) –

    The input text to be split.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr', 'auto').

Returns:

  • list[str]

    list[str]: A list of sentences extracted from the text.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
def split_text(self, text: str, lang: str = "auto") -> list[str]:
    """Splits the given text into a list of sentences.

    Args:
        text (str): The input text to be split.
        lang (str): The language of the text (e.g., 'en', 'fr', 'auto').

    Returns:
        list[str]: A list of sentences extracted from the text.
    """
    raise NotImplementedError("Subclasses must implement 'split_text'.")

CallbackError

Bases: ChunkletError

Raised when a callback function provided to chunker or splitter fails during execution.

CustomSplitterRegistry

Methods:

  • clear

    Clears all registered splitters from the registry.

  • is_registered

    Check if a splitter is registered for the given language.

  • register

    Register a splitter callback for one or more languages.

  • split

    Processes a text using a splitter registered for the given language.

  • unregister

    Remove splitter(s) from the registry.

Attributes:

  • splitters

    Returns a shallow copy of the dictionary of registered splitters.

splitters property

splitters

Returns a shallow copy of the dictionary of registered splitters.

This prevents external modification of the internal registry state.

clear

clear() -> None

Clears all registered splitters from the registry.

Source code in src/chunklet/sentence_splitter/registry.py
def clear(self) -> None:
    """
    Clears all registered splitters from the registry.
    """
    self._splitters.clear()

is_registered

is_registered(lang: str) -> bool

Check if a splitter is registered for the given language.

Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def is_registered(self, lang: str) -> bool:
    """
    Check if a splitter is registered for the given language.
    """
    return lang in self._splitters

register

register(*args: Any, name: str | None = None)

Register a splitter callback for one or more languages.

This method can be used in two ways:

  1. As a decorator: @registry.register("en", "fr", name="my_splitter") def my_splitter(text): ...

  2. As a direct function call: registry.register(my_splitter, "en", "fr", name="my_splitter")

Parameters:

  • *args

    (Any, default: () ) –

    The arguments, which can be either (lang1, lang2, ...) for a decorator or (callback, lang1, lang2, ...) for a direct call.

  • name

    (str, default: None ) –

    The name of the splitter. If None, attempts to use the callback's name.

Source code in src/chunklet/sentence_splitter/registry.py
def register(self, *args: Any, name: str | None = None):
    """
    Register a splitter callback for one or more languages.

    This method can be used in two ways:

    1. As a decorator:
        @registry.register("en", "fr", name="my_splitter")
        def my_splitter(text):
            ...

    2. As a direct function call:
        registry.register(my_splitter, "en", "fr", name="my_splitter")

    Args:
        *args: The arguments, which can be either (lang1, lang2, ...) for a decorator
               or (callback, lang1, lang2, ...) for a direct call.
        name (str, optional): The name of the splitter. If None, attempts to use the callback's name.
    """
    if not args:
        raise ValueError("At least one language or a callback must be provided.")

    if callable(args[0]):
        # Direct call: register(callback, lang1, lang2, ...)
        callback = args[0]
        langs = args[1:]
        if not langs:
            raise ValueError(
                "At least one language must be provided for the callback."
            )
        self._register_logic(langs, callback, name)
        return callback
    else:
        # Decorator: @register(lang1, lang2, ...)
        langs = args

        def decorator(cb: Callable):
            self._register_logic(langs, cb, name)
            return cb

        return decorator

split

split(text: str, lang: str) -> tuple[list[str], str]

Processes a text using a splitter registered for the given language.

Parameters:

  • text

    (str) –

    The text to split.

  • lang

    (str) –

    The language of the text.

Returns:

  • tuple[list[str], str]

    tuple[list[str], str]: A tuple containing a list of sentences and the name of the splitter used.

Raises:

  • CallbackError

    If the splitter callback fails.

  • TypeError

    If the splitter returns the wrong type.

Examples:

>>> from chunklet.sentence_splitter import CustomSplitterRegistry
>>> registry = CustomSplitterRegistry()
>>> @registry.register("xx", name="custom_splitter")
... def custom_splitter(text: str) -> list[str]:
...     return text.split(" ")
>>> registry.split("Hello World", "xx")
(['Hello', 'World'], 'custom_splitter')
Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def split(self, text: str, lang: str) -> tuple[list[str], str]:
    """
    Processes a text using a splitter registered for the given language.

    Args:
        text (str): The text to split.
        lang (str): The language of the text.

    Returns:
        tuple[list[str], str]: A tuple containing a list of sentences and the name of the splitter used.

    Raises:
        CallbackError: If the splitter callback fails.
        TypeError: If the splitter returns the wrong type.

    Examples:
        >>> from chunklet.sentence_splitter import CustomSplitterRegistry
        >>> registry = CustomSplitterRegistry()
        >>> @registry.register("xx", name="custom_splitter")
        ... def custom_splitter(text: str) -> list[str]:
        ...     return text.split(" ")
        >>> registry.split("Hello World", "xx")
        (['Hello', 'World'], 'custom_splitter')
    """
    splitter_info = self._splitters.get(lang)
    if not splitter_info:
        raise CallbackError(
            f"No splitter registered for language '{lang}'.\n"
            f"💡Hint: Use `.register('{lang}', fn=your_function)` first."
        )

    name, callback = splitter_info

    try:
        # Validate the return type
        result = callback(text)
        validator = TypeAdapter(list[str])
        validator.validate_python(result)
    except ValidationError as e:
        e.subtitle = f"{name} result"
        e.hint = "💡Hint: Make sure your splitter returns a list of strings."
        raise CallbackError(f"{pretty_errors(e)}.\n") from None
    except Exception as e:
        raise CallbackError(
            f"Splitter '{name}' for lang '{lang}' raised an exception.\nDetails: {e}"
        ) from None

    return result, name

unregister

unregister(*langs: str) -> None

Remove splitter(s) from the registry.

Parameters:

  • *langs

    (str, default: () ) –

    Language codes to remove

Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def unregister(self, *langs: str) -> None:
    """
    Remove splitter(s) from the registry.

    Args:
        *langs: Language codes to remove
    """
    for lang in langs:
        self._splitters.pop(lang, None)

FallbackSplitter

FallbackSplitter()

Rule-based, language-agnostic sentence boundary detector.

A rule-based, sentence boundary detection tool that doesn't rely on hardcoded lists of abbreviations or sentence terminators, making it adaptable to various text formats and domains.

FallbackSplitter uses regex patterns to split text into sentences, handling: - Common sentence-ending punctuation (., !, ?) - Abbreviations and acronyms (e.g., Dr., Ph.D., U.S.) - Numbered lists and headings - Multi-punctuation sequences (e.g., ! ! !, ?!) - Line breaks and whitespace normalization - Decimal numbers and inline numbers

Sentences are conservatively segmented, prioritizing context over aggressive splitting, which reduces false splits inside abbreviations, multi-punctuation sequences, or numeric constructs.

Initializes regex patterns for sentence splitting.

Methods:

  • split

    Splits text into sentences using rule-based regex patterns.

Source code in src/chunklet/sentence_splitter/_fallback_splitter.py
def __init__(self):
    """Initializes regex patterns for sentence splitting."""
    self.sentence_terminators = "".join(GLOBAL_SENTENCE_TERMINATORS)

    # Patterns for handling numbered lists
    self.flattened_numbered_list_pattern = re.compile(
        rf"(?<=[{self.sentence_terminators}:])\s+(\p{{N}}\.)+"
    )

    self.numbered_list_pattern = re.compile(r"([\n:]\s*)(\p{N})\.")
    self.norm_numbered_list_pattern = re.compile(r"(\s*)(\p{N})<DOT>")

    # Core sentence split regex
    self.sentence_end_pattern = re.compile(
        rf"""
        (?<!\b(\p{{Lu}}\p{{Ll}}{{1, 5}}\.)*)   # negative lookbehind for abbreviations
        (?<=[{self.sentence_terminators}]        # sentence-ending punctuation
        [\"'》」\p{{pf}}\p{{pe}}]*)                  # optional quotes or closing chars
        (?=\s+\p{{Lu}}|\s*\n|\s*$)               # followed by uppercase or end of text
        """,
        re.VERBOSE | re.UNICODE,
    )

split

split(text: str) -> list[str]

Splits text into sentences using rule-based regex patterns.

Parameters:

  • text

    (str) –

    The input text to be segmented into sentences.

Returns:

  • list[str]

    list[str]: A list of sentences after segmentation.

Notes
  • Normalizes numbered lists during splitting and restores them afterward.
  • Handles punctuation, newlines, and common edge cases.
Source code in src/chunklet/sentence_splitter/_fallback_splitter.py
def split(self, text: str) -> list[str]:
    """
    Splits text into sentences using rule-based regex patterns.

    Args:
        text (str): The input text to be segmented into sentences.

    Returns:
        list[str]: A list of sentences after segmentation.

    Notes:
        - Normalizes numbered lists during splitting and restores them afterward.
        - Handles punctuation, newlines, and common edge cases.
    """
    # Stage 1: handle flattened numbered lists
    text = self.flattened_numbered_list_pattern.sub(r"\n \1", text.strip())

    # Stage 2: normalize numbered lists
    text = self.numbered_list_pattern.sub(r"\1\2<DOT>", text.strip())

    # Stage 3: first pass - punctuation-based split
    sentences = self.sentence_end_pattern.split(text.strip())

    # Stage 4: remove empty strings and strip whitespace
    fixed_sentences = [s.strip() for s in sentences if s and s.strip()]

    # Stage 5: second pass - split further on newline (if not at start)
    final_sentences = []
    for sent in fixed_sentences:
        final_sentences.extend(sent.splitlines())

    # Stage 6: remove _ in numbered list numbers
    return [
        self.norm_numbered_list_pattern.sub(r"\1\2.", sent).rstrip()
        for sent in final_sentences
        if sent.strip()
    ]

SentenceSplitter

SentenceSplitter(verbose: bool = False)

Bases: BaseSplitter

A robust and versatile utility dedicated to precisely segmenting text into individual sentences.

Key Features: - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Custom Splitters: Uses centralized registry for custom splitting logic. - Fallback Mechanism: Employs a universal rule-based splitter for unsupported languages. - Robust Error Handling: Provides clear error reporting for issues with custom splitters. - Intelligent Post-processing: Cleans up split sentences by filtering empty strings and rejoining stray punctuation.

Initializes the SentenceSplitter.

Parameters:

  • verbose

    (bool, default: False ) –

    If True, enables verbose logging for debugging and informational messages.

Methods:

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@validate_input
def __init__(self, verbose: bool = False):
    """
    Initializes the SentenceSplitter.

    Args:
        verbose (bool, optional): If True, enables verbose logging for debugging and informational messages.
    """
    self.verbose = verbose
    self.fallback_splitter = FallbackSplitter()

    # Create a normalized identifier for language detection
    self.identifier = LanguageIdentifier.from_pickled_model(
        MODEL_FILE, norm_probs=True
    )

detected_top_language

detected_top_language(text: str) -> tuple[str, float]

Detects the top language of the given text using py3langid.

Parameters:

  • text

    (str) –

    The input text to detect the language for.

Returns:

  • tuple[str, float]

    tuple[str, float]: A tuple containing the detected language code and its confidence.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@validate_input
def detected_top_language(self, text: str) -> tuple[str, float]:
    """
    Detects the top language of the given text using py3langid.

    Args:
        text (str): The input text to detect the language for.

    Returns:
        tuple[str, float]: A tuple containing the detected language code and its confidence.
    """
    lang_detected, confidence = self.identifier.classify(text)
    log_info(
        self.verbose,
        "Language detection: '{}' with confidence {}.",
        lang_detected,
        f"{round(confidence) * 10}/10",
    )
    return lang_detected, confidence

split_file

split_file(
    path: str | Path, lang: str = "auto"
) -> list[str]

Read and split a file into sentences.

Parameters:

  • path

    (str | Path) –

    Path to the file to read.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to 'auto'.

Returns:

  • list[str]

    list[str]: A list of sentences extracted from the file.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
def split_file(self, path: str | Path, lang: str = "auto") -> list[str]:
    """
    Read and split a file into sentences.

    Args:
        path: Path to the file to read.
        lang: The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to 'auto'.

    Returns:
        list[str]: A list of sentences extracted from the file.
    """
    content = read_text_file(path)
    return self.split_text(content, lang)

split_text

split_text(text: str, lang: str = 'auto') -> list[str]

Splits a given text into a list of sentences.

Parameters:

  • text

    (str) –

    The input text to be split.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'

Returns:

  • list[str]

    list[str]: A list of sentences.

Examples:

>>> splitter = SentenceSplitter()
>>> splitter.split_text("Hello world. How are you?", "en")
['Hello world.', 'How are you?']
>>> splitter.split_text("Bonjour le monde. Comment allez-vous?", "fr")
['Bonjour le monde.', 'Comment allez-vous?']
>>> splitter.split_text("Hello world. How are you?", "auto")
['Hello world.', 'How are you?']
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@validate_input
def split_text(self, text: str, lang: str = "auto") -> list[str]:
    """
    Splits a given text into a list of sentences.

    Args:
        text (str): The input text to be split.
        lang (str, optional): The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'

    Returns:
        list[str]: A list of sentences.

    Examples:
        >>> splitter = SentenceSplitter()
        >>> splitter.split_text("Hello world. How are you?", "en")
        ['Hello world.', 'How are you?']
        >>> splitter.split_text("Bonjour le monde. Comment allez-vous?", "fr")
        ['Bonjour le monde.', 'Comment allez-vous?']
        >>> splitter.split_text("Hello world. How are you?", "auto")
        ['Hello world.', 'How are you?']
    """
    if not text:
        log_info(self.verbose, "Input text is empty. Returning empty list.")
        return []

    if lang == "auto":
        logger.warning(
            "The language is set to `auto`. Consider setting the `lang` parameter to a specific language to improve reliability."
        )
        lang_detected, confidence = self.detected_top_language(text)
        lang = lang_detected if confidence >= 0.7 else lang

    # Prioritize custom splitters from registry
    if custom_splitter_registry.is_registered(lang):
        sentences, splitter_name = custom_splitter_registry.split(text, lang)
        log_info(self.verbose, "Using registered splitter: {}", splitter_name)
    else:
        sentences = None
        for lang_set, handler in self.LANGUAGE_HANDLERS.items():
            if lang in lang_set:
                sentences = handler(lang, text)
                break

        # If no handler found, use fallback
        if sentences is None:
            logger.warning(
                "Using a universal rule-based splitter.\n"
                "Reason: Language not supported or detected with low confidence."
            )
            sentences = self.fallback_splitter.split(text)

    processed_sentences = self._filter_sentences(sentences)

    log_info(
        self.verbose,
        "Text splitted into sentences. Total sentences detected: {}",
        len(processed_sentences),
    )

    return processed_sentences

deprecated_callable

deprecated_callable(
    use_instead: str, deprecated_in: str, removed_in: str
) -> Callable

Decorate a function or class with warning message.

This decorator marks a function or class as deprecated.

Parameters:

  • use_instead

    (str) –

    Replacement name (e.g., "split_text", "DocumentChunker", or "chunk_text or chunk_file").

  • deprecated_in

    (str) –

    Version when the function was deprecated (e.g., "2.2.0").

  • removed_in

    (str) –

    Version when the function will be removed (e.g., "3.0.0").

Returns:

  • Callable ( Callable ) –

    Decorator function that wraps the source function/class.

Source code in src/chunklet/common/deprecation.py
def deprecated_callable(
    use_instead: str,
    deprecated_in: str,
    removed_in: str,
) -> Callable:
    """Decorate a function or class with warning message.

    This decorator marks a function or class as deprecated.

    Args:
        use_instead (str): Replacement name (e.g., "split_text", "DocumentChunker", or "chunk_text or chunk_file").
        deprecated_in (str): Version when the function was deprecated (e.g., "2.2.0").
        removed_in (str): Version when the function will be removed (e.g., "3.0.0").

    Returns:
        Callable: Decorator function that wraps the source function/class.
    """

    def decorator(func_or_cls: Callable) -> Callable:
        warn_message = (
            f"`{func_or_cls.__qualname__}` was deprecated since v{deprecated_in} "
            f"in favor of `{use_instead}`. It will be removed in v{removed_in}."
        )
        remove_message = (
            f"`{func_or_cls.__qualname__}` was removed in v{removed_in}. "
            f"Use `{use_instead}` instead."
        )

        @functools.wraps(func_or_cls)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            if Version(CURRENT_VERSION) >= Version(removed_in):
                raise AttributeError(remove_message)
            warnings.warn(warn_message, FutureWarning, stacklevel=2)
            return func_or_cls(*args, **kwargs)

        return wrapper

    return decorator

log_info

log_info(verbose: bool, *args, **kwargs) -> None

Log an info message if verbose is enabled.

This is a convenience function that only logs when verbose mode is enabled, avoiding unnecessary log output in production.

Parameters:

  • verbose

    (bool) –

    If True, logs the message; if False, does nothing.

  • *args

    Positional arguments passed to logger.info().

  • **kwargs

    Keyword arguments passed to logger.info().

Example

log_info(True, "Processing file: {}", filepath) Processing file: /path/to/file log_info(False, "This will not be logged") (no output)

Source code in src/chunklet/common/logging_utils.py
def log_info(verbose: bool, *args, **kwargs) -> None:
    """Log an info message if verbose is enabled.

    This is a convenience function that only logs when verbose mode is enabled,
    avoiding unnecessary log output in production.

    Args:
        verbose: If True, logs the message; if False, does nothing.
        *args: Positional arguments passed to logger.info().
        **kwargs: Keyword arguments passed to logger.info().

    Example:
        >>> log_info(True, "Processing file: {}", filepath)
        Processing file: /path/to/file
        >>> log_info(False, "This will not be logged")
        (no output)
    """
    if verbose:
        logger.info(*args, **kwargs)

pretty_errors

pretty_errors(error: ValidationError) -> str

Formats Pydantic validation errors into a human-readable string.

Source code in src/chunklet/common/validation.py
def pretty_errors(error: ValidationError) -> str:
    """Formats Pydantic validation errors into a human-readable string."""
    lines = [
        f"{error.error_count()} validation error for {getattr(error, 'subtitle', '') or error.title}."
    ]
    for ind, err in enumerate(error.errors(), start=1):
        msg = err["msg"]

        loc = err.get("loc", [])
        formatted_loc = ""
        if len(loc) >= 1:
            formatted_loc = str(loc[0]) + "".join(f"[{step!r}]" for step in loc[1:])
            formatted_loc = f"({formatted_loc})" if formatted_loc else ""

        input_value = err["input"]
        input_type = type(input_value).__name__

        # Sliced to avoid overflowing screen
        input_value = (
            input_value
            if len(str(input_value)) < 500
            else str(input_value)[:500] + "..."
        )

        lines.append(
            (
                f"{ind}) {formatted_loc} {msg}.\n"
                f"  Found: (input={input_value!r}, type={input_type})"
            )
        )

    lines.append("  " + getattr(error, "hint", ""))
    return "\n".join(lines)

read_text_file

read_text_file(path: str | Path) -> str

Read text file with automatic encoding detection.

Parameters:

  • path

    (str | Path) –

    File path to read.

Returns:

  • str ( str ) –

    File content.

Raises:

Source code in src/chunklet/common/path_utils.py
@validate_input
def read_text_file(path: str | Path) -> str:
    """Read text file with automatic encoding detection.

    Args:
        path: File path to read.

    Returns:
        str: File content.

    Raises:
        FileProcessingError: If file cannot be read.
    """
    from charset_normalizer import from_path

    path = Path(path)

    if not path.exists():
        raise FileProcessingError(f"File does not exist: {path}")

    if _is_binary_file(path):
        raise FileProcessingError(f"Binary file not supported: {path}")

    match = from_path(str(path)).best()
    return str(match) if match else ""

validate_input

validate_input(fn)

A decorator that validates function inputs and outputs

A wrapper around Pydantic's validate_call that catchesValidationError and re-raises it as a more user-friendly InvalidInputError.

Source code in src/chunklet/common/validation.py
def validate_input(fn):
    """
    A decorator that validates function inputs and outputs

    A wrapper around Pydantic's `validate_call` that catches`ValidationError` and re-raises it as a more user-friendly `InvalidInputError`.
    """
    validated_fn = validate_call(fn, config=ConfigDict(arbitrary_types_allowed=True))

    @wraps(fn)
    def wrapper(*args, **kwargs):
        try:
            return validated_fn(*args, **kwargs)
        except ValidationError as e:
            raise InvalidInputError(pretty_errors(e)) from None

    return wrapper