Skip to content

chunklet.sentence_splitter

Modules:

Classes:

Functions:

  • pretty_errors

    Formats Pydantic validation errors into a human-readable string.

  • validate_input

    A decorator that validates function inputs and outputs

BaseSplitter

Bases: ABC

Abstract base class for sentence splitting. Defines the interface that all sentence splitter implementations must adhere to.

Methods:

  • split

    Splits the given text into a list of sentences.

split abstractmethod

split(text: str, lang: str) -> list[str]

Splits the given text into a list of sentences.

text (str): The input text to be split. lang (str): The language of the text (e.g., 'en', 'fr', 'auto').

Returns:

  • list[str]

    list[str]: A list of sentences extracted from the text.

Examples:

>>> class MySplitter(BaseSplitter):
...     def split(self, text: str, lang: str) -> list[str]:
...         return text.split(".")
>>> splitter = MySplitter()
>>> splitter.split("Hello. World.", "en")
['Hello', ' World']
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@abstractmethod
def split(self, text: str, lang: str) -> list[str]:
    """
    Splits the given text into a list of sentences.

    text (str): The input text to be split.
        lang (str): The language of the text (e.g., 'en', 'fr', 'auto').

    Returns:
        list[str]: A list of sentences extracted from the text.

    Examples:
        >>> class MySplitter(BaseSplitter):
        ...     def split(self, text: str, lang: str) -> list[str]:
        ...         return text.split(".")
        >>> splitter = MySplitter()
        >>> splitter.split("Hello. World.", "en")
        ['Hello', ' World']
    """
    pass

CallbackError

Bases: ChunkletError

Raised when a callback function provided to chunker or splitter fails during execution.

CustomSplitterRegistry

Methods:

  • clear

    Clears all registered splitters from the registry.

  • is_registered

    Check if a splitter is registered for the given language.

  • register

    Register a splitter callback for one or more languages.

  • split

    Processes a text using a splitter registered for the given language.

  • unregister

    Remove splitter(s) from the registry.

Attributes:

  • splitters

    Returns a shallow copy of the dictionary of registered splitters.

splitters property

splitters

Returns a shallow copy of the dictionary of registered splitters.

This prevents external modification of the internal registry state.

clear

clear() -> None

Clears all registered splitters from the registry.

Source code in src/chunklet/sentence_splitter/registry.py
def clear(self) -> None:
    """
    Clears all registered splitters from the registry.
    """
    self._splitters.clear()

is_registered

is_registered(lang: str) -> bool

Check if a splitter is registered for the given language.

Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def is_registered(self, lang: str) -> bool:
    """
    Check if a splitter is registered for the given language.
    """
    return lang in self._splitters

register

register(*args: Any, name: str | None = None)

Register a splitter callback for one or more languages.

This method can be used in two ways: 1. As a decorator: @registry.register("en", "fr", name="my_splitter") def my_splitter(text): ...

  1. As a direct function call: registry.register(my_splitter, "en", "fr", name="my_splitter")

Parameters:

  • *args

    (Any, default: () ) –

    The arguments, which can be either (lang1, lang2, ...) for a decorator or (callback, lang1, lang2, ...) for a direct call.

  • name

    (str, default: None ) –

    The name of the splitter. If None, attempts to use the callback's name.

Source code in src/chunklet/sentence_splitter/registry.py
def register(self, *args: Any, name: str | None = None):
    """
    Register a splitter callback for one or more languages.

    This method can be used in two ways:
    1. As a decorator:
        @registry.register("en", "fr", name="my_splitter")
        def my_splitter(text):
            ...

    2. As a direct function call:
        registry.register(my_splitter, "en", "fr", name="my_splitter")

    Args:
        *args: The arguments, which can be either (lang1, lang2, ...) for a decorator
               or (callback, lang1, lang2, ...) for a direct call.
        name (str, optional): The name of the splitter. If None, attempts to use the callback's name.
    """
    if not args:
        raise ValueError("At least one language or a callback must be provided.")

    if callable(args[0]):
        # Direct call: register(callback, lang1, lang2, ...)
        callback = args[0]
        langs = args[1:]
        if not langs:
            raise ValueError(
                "At least one language must be provided for the callback."
            )
        self._register_logic(langs, callback, name)
        return callback
    else:
        # Decorator: @register(lang1, lang2, ...)
        langs = args

        def decorator(cb: Callable):
            self._register_logic(langs, cb, name)
            return cb

        return decorator

split

split(text: str, lang: str) -> tuple[list[str], str]

Processes a text using a splitter registered for the given language.

Parameters:

  • text

    (str) –

    The text to split.

  • lang

    (str) –

    The language of the text.

Returns:

  • tuple[list[str], str]

    tuple[list[str], str]: A tuple containing a list of sentences and the name of the splitter used.

Raises:

  • CallbackError

    If the splitter callback fails.

  • TypeError

    If the splitter returns the wrong type.

Examples:

>>> from chunklet.sentence_splitter import CustomSplitterRegistry
>>> registry = CustomSplitterRegistry()
>>> @registry.register("xx", name="custom_splitter")
... def custom_splitter(text: str) -> list[str]:
...     return text.split(" ")
>>> registry.split("Hello World", "xx")
(['Hello', 'World'], 'custom_splitter')
Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def split(self, text: str, lang: str) -> tuple[list[str], str]:
    """
    Processes a text using a splitter registered for the given language.

    Args:
        text (str): The text to split.
        lang (str): The language of the text.

    Returns:
        tuple[list[str], str]: A tuple containing a list of sentences and the name of the splitter used.

    Raises:
        CallbackError: If the splitter callback fails.
        TypeError: If the splitter returns the wrong type.

    Examples:
        >>> from chunklet.sentence_splitter import CustomSplitterRegistry
        >>> registry = CustomSplitterRegistry()
        >>> @registry.register("xx", name="custom_splitter")
        ... def custom_splitter(text: str) -> list[str]:
        ...     return text.split(" ")
        >>> registry.split("Hello World", "xx")
        (['Hello', 'World'], 'custom_splitter')
    """
    splitter_info = self._splitters.get(lang)
    if not splitter_info:
        raise CallbackError(
            f"No splitter registered for language '{lang}'.\n"
            f"💡Hint: Use `.register('{lang}', fn=your_function)` first."
        )

    name, callback = splitter_info

    try:
        # Validate the return type
        result = callback(text)
        validator = TypeAdapter(list[str])
        validator.validate_python(result)
    except ValidationError as e:
        e.subtitle = f"{name} result"
        e.hint = "💡Hint: Make sure your splitter returns a list of strings."
        raise CallbackError(f"{pretty_errors(e)}.\n") from None
    except Exception as e:
        raise CallbackError(
            f"Splitter '{name}' for lang '{lang}' raised an exception.\nDetails: {e}"
        ) from None

    return result, name

unregister

unregister(*langs: str) -> None

Remove splitter(s) from the registry.

Parameters:

  • *langs

    (str, default: () ) –

    Language codes to remove

Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def unregister(self, *langs: str) -> None:
    """
    Remove splitter(s) from the registry.

    Args:
        *langs: Language codes to remove
    """
    for lang in langs:
        self._splitters.pop(lang, None)

FallbackSplitter

FallbackSplitter()

Rule-based, language-agnostic sentence boundary detector.

A rule-based, sentence boundary detection tool that doesn't rely on hardcoded lists of abbreviations or sentence terminators, making it adaptable to various text formats and domains.

FallbackSplitter uses regex patterns to split text into sentences, handling: - Common sentence-ending punctuation (., !, ?) - Abbreviations and acronyms (e.g., Dr., Ph.D., U.S.) - Numbered lists and headings - Multi-punctuation sequences (e.g., ! ! !, ?!) - Line breaks and whitespace normalization - Decimal numbers and inline numbers

Sentences are conservatively segmented, prioritizing context over aggressive splitting, which reduces false splits inside abbreviations, multi-punctuation sequences, or numeric constructs.

Initializes regex patterns for sentence splitting.

Methods:

  • split

    Splits text into sentences using rule-based regex patterns.

Source code in src/chunklet/sentence_splitter/_fallback_splitter.py
def __init__(self):
    """Initializes regex patterns for sentence splitting."""
    self.sentence_terminators = "".join(GLOBAL_SENTENCE_TERMINATORS)

    # Patterns for handling numbered lists
    self.flattened_numbered_list_pattern = re.compile(
        rf"(?<=[{self.sentence_terminators}:])\s+(\p{{N}}\.)+"
    )

    self.numbered_list_pattern = re.compile(r"([\n:]\s*)(\p{N})\.")
    self.norm_numbered_list_pattern = re.compile(r"(\s*)(\p{N})<DOT>")

    # Core sentence split regex
    self.sentence_end_pattern = re.compile(
        rf"""
        (?<!\b(\p{{Lu}}\p{{Ll}}{{1, 5}}\.)*)   # negative lookbehind for abbreviations
        (?<=[{self.sentence_terminators}]        # sentence-ending punctuation
        [\"'》」\p{{pf}}\p{{pe}}]*)                  # optional quotes or closing chars
        (?=\s+\p{{Lu}}|\s*\n|\s*$)               # followed by uppercase or end of text
        """,
        re.VERBOSE | re.UNICODE,
    )

split

split(text: str) -> List[str]

Splits text into sentences using rule-based regex patterns.

Parameters:

  • text

    (str) –

    The input text to be segmented into sentences.

Returns:

  • List[str]

    List[str]: A list of sentences after segmentation.

Notes
  • Normalizes numbered lists during splitting and restores them afterward.
  • Handles punctuation, newlines, and common edge cases.
Source code in src/chunklet/sentence_splitter/_fallback_splitter.py
def split(self, text: str) -> List[str]:
    """
    Splits text into sentences using rule-based regex patterns.

    Args:
        text (str): The input text to be segmented into sentences.

    Returns:
        List[str]: A list of sentences after segmentation.

    Notes:
        - Normalizes numbered lists during splitting and restores them afterward.
        - Handles punctuation, newlines, and common edge cases.
    """
    # Stage 1: handle flattened numbered lists
    text = self.flattened_numbered_list_pattern.sub(r"\n \1", text.strip())

    # Stage 2: normalize numbered lists
    text = self.numbered_list_pattern.sub(r"\1\2<DOT>", text.strip())

    # Stage 3: first pass - punctuation-based split
    sentences = self.sentence_end_pattern.split(text.strip())

    # Stage 4: remove empty strings and strip whitespace
    fixed_sentences = [s.strip() for s in sentences if s and s.strip()]

    # Stage 5: second pass - split further on newline (if not at start)
    final_sentences = []
    for sent in fixed_sentences:
        final_sentences.extend(sent.splitlines())

    # Stage 6: remove _ in numbered list numbers
    return [
        self.norm_numbered_list_pattern.sub(r"\1\2.", sent).rstrip()
        for sent in final_sentences
        if sent.strip()
    ]

SentenceSplitter

SentenceSplitter(verbose: bool = False)

Bases: BaseSplitter

A robust and versatile utility dedicated to precisely segmenting text into individual sentences.

Key Features: - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Custom Splitters: Uses centralized registry for custom splitting logic. - Fallback Mechanism: Employs a universal rule-based splitter for unsupported languages. - Robust Error Handling: Provides clear error reporting for issues with custom splitters. - Intelligent Post-processing: Cleans up split sentences by filtering empty strings and rejoining stray punctuation.

Initializes the SentenceSplitter.

Parameters:

  • verbose

    (bool, default: False ) –

    If True, enables verbose logging for debugging and informational messages.

Methods:

  • detected_top_language

    Detects the top language of the given text using py3langid.

  • split

    Splits a given text into a list of sentences.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@validate_input
def __init__(self, verbose: bool = False):
    """
    Initializes the SentenceSplitter.

    Args:
        verbose (bool, optional): If True, enables verbose logging for debugging and informational messages.
    """
    self.verbose = verbose
    self.custom_splitter_registry = CustomSplitterRegistry()
    self.fallback_splitter = FallbackSplitter()

    # Create a normalized identifier for langid
    self.identifier = LanguageIdentifier.from_pickled_model(
        MODEL_FILE, norm_probs=True
    )

detected_top_language

detected_top_language(text: str) -> tuple[str, float]

Detects the top language of the given text using py3langid.

Parameters:

  • text

    (str) –

    The input text to detect the language for.

Returns:

  • tuple[str, float]

    tuple[str, float]: A tuple containing the detected language code and its confidence.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@validate_input
def detected_top_language(self, text: str) -> tuple[str, float]:
    """
    Detects the top language of the given text using py3langid.

    Args:
        text (str): The input text to detect the language for.

    Returns:
        tuple[str, float]: A tuple containing the detected language code and its confidence.
    """
    lang_detected, confidence = self.identifier.classify(text)
    if self.verbose:
        logger.info(
            "Language detection: '{}' with confidence {}.",
            lang_detected,
            f"{round(confidence)  * 10}/10",
        )
    return lang_detected, confidence

split

split(text: str, lang: str = 'auto') -> list[str]

Splits a given text into a list of sentences.

Parameters:

  • text

    (str) –

    The input text to be split.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'

Returns:

  • list[str]

    list[str]: A list of sentences.

Examples:

>>> splitter = SentenceSplitter()
>>> splitter.split("Hello world. How are you?", "en")
['Hello world.', 'How are you?']
>>> splitter.split("Bonjour le monde. Comment allez-vous?", "fr")
['Bonjour le monde.', 'Comment allez-vous?']
>>> splitter.split("Hello world. How are you?", "auto")
['Hello world.', 'How are you?']
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@validate_input
def split(self, text: str, lang: str = "auto") -> list[str]:
    """
    Splits a given text into a list of sentences.

    Args:
        text (str): The input text to be split.
        lang (str, optional): The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'

    Returns:
        list[str]: A list of sentences.

    Examples:
        >>> splitter = SentenceSplitter()
        >>> splitter.split("Hello world. How are you?", "en")
        ['Hello world.', 'How are you?']
        >>> splitter.split("Bonjour le monde. Comment allez-vous?", "fr")
        ['Bonjour le monde.', 'Comment allez-vous?']
        >>> splitter.split("Hello world. How are you?", "auto")
        ['Hello world.', 'How are you?']
    """
    if not text:
        if self.verbose:
            logger.info("Input text is empty. Returning empty list.")
        return []
    sentences = []

    if lang == "auto":
        if self.verbose:
            logger.warning(
                "The language is set to `auto`. Consider setting the `lang` parameter to a specific language to improve reliability."
            )
        lang_detected, confidence = self.detected_top_language(text)
        lang = lang_detected if confidence >= 0.7 else lang

    # Prioritize custom splitters from registry
    if self.custom_splitter_registry.is_registered(lang):
        sentences, splitter_name = self.custom_splitter_registry.split(text, lang)
        if self.verbose:
            logger.info("Using registered splitter: {}", splitter_name)
    elif lang in PYSBD_SUPPORTED_LANGUAGES:
        sentences = Segmenter(language=lang).segment(text)
    elif lang in SENTSPLIT_UNIQUE_LANGUAGES:
        sentences = SentSplit(lang).segment(text)
    elif lang in INDIC_NLP_UNIQUE_LANGUAGES:
        sentences = sentence_tokenize.sentence_split(text, lang)
    elif lang in SENTENCEX_UNIQUE_LANGUAGES:
        sentences = segment(lang, text)
    else:
        if self.verbose:
            logger.warning(
                "Using a universal rule-based splitter.\n"
                "Reason: Language not supported or detected with low confidence."
            )
        sentences = self.fallback_splitter.split(text)

    # Apply post-processing filter
    processed_sentences = self._filter_sentences(sentences)

    if self.verbose:
        logger.info(
            "Text splitted into sentences. Total sentences detected: {}",
            len(processed_sentences),
        )

    return processed_sentences

pretty_errors

pretty_errors(error: ValidationError) -> str

Formats Pydantic validation errors into a human-readable string.

Source code in src/chunklet/common/validation.py
def pretty_errors(error: ValidationError) -> str:
    """Formats Pydantic validation errors into a human-readable string."""
    lines = [
        f"{error.error_count()} validation error for {getattr(error, 'subtitle', '') or error.title}."
    ]
    for ind, err in enumerate(error.errors(), start=1):
        msg = err["msg"]

        loc = err.get("loc", [])
        formatted_loc = ""
        if len(loc) >= 1:
            formatted_loc = str(loc[0]) + "".join(f"[{step!r}]" for step in loc[1:])
            formatted_loc = f"({formatted_loc})" if formatted_loc else ""

        input_value = err["input"]
        input_type = type(input_value).__name__

        # Sliced to avoid overflowing screen
        input_value = (
            input_value
            if len(str(input_value)) < 500
            else str(input_value)[:500] + "..."
        )

        lines.append(
            f"{ind}) {formatted_loc} {msg}.\n"
            f"  Found: (input={input_value!r}, type={input_type})"
        )

    lines.append("  " + getattr(error, "hint", ""))
    return "\n".join(lines)

validate_input

validate_input(fn)

A decorator that validates function inputs and outputs

A wrapper around Pydantic's validate_call that catchesValidationError and re-raises it as a more user-friendly InvalidInputError.

Source code in src/chunklet/common/validation.py
def validate_input(fn):
    """
    A decorator that validates function inputs and outputs

    A wrapper around Pydantic's `validate_call` that catches`ValidationError` and re-raises it as a more user-friendly `InvalidInputError`.
    """
    validated_fn = validate_call(fn, config=ConfigDict(arbitrary_types_allowed=True))

    @wraps(fn)
    def wrapper(*args, **kwargs):
        try:
            return validated_fn(*args, **kwargs)
        except ValidationError as e:
            raise InvalidInputError(pretty_errors(e)) from None

    return wrapper