Skip to content

chunklet.sentence_splitter

Modules:

Classes:

Functions:

  • deprecated_callable

    Decorate a function or class with warning message.

  • log_info

    Log an info message if verbose is enabled.

  • pretty_errors

    Formats Pydantic validation errors into a human-readable string.

  • read_text_file

    Read text file with automatic encoding detection.

  • validate_input

    A decorator that validates function inputs and outputs

BaseSplitter

Base class for sentence splitting. Defines the interface that all splitter implementations must adhere to.

Methods:

  • split

    Split text into sentences.

  • split_text

    Splits the given text into a list of sentences.

split

split(text: str, lang: str = 'auto') -> list[str]

Split text into sentences.

Note

Deprecated since 2.2.0. Will be removed in 3.0.0. Use split_text instead.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@deprecated_callable(
    use_instead="split_text", deprecated_in="2.2.0", removed_in="3.0.0"
)
def split(self, text: str, lang: str = "auto") -> list[str]:  # pragma: no cover
    """
    Split text into sentences.

    Note:
        Deprecated since 2.2.0. Will be removed in 3.0.0. Use `split_text` instead.
    """
    return self.split_text(text, lang)

split_text

split_text(text: str, lang: str = 'auto') -> list[str]

Splits the given text into a list of sentences.

Parameters:

  • text

    (str) –

    The input text to be split.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr', 'auto').

Returns:

  • list[str]

    A list of sentences extracted from the text.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
def split_text(self, text: str, lang: str = "auto") -> list[str]:
    """Splits the given text into a list of sentences.

    Args:
        text: The input text to be split.
        lang: The language of the text (e.g., 'en', 'fr', 'auto').

    Returns:
        A list of sentences extracted from the text.
    """
    raise NotImplementedError("Subclasses must implement 'split_text'.")

CallbackError

Bases: ChunkletError

Raised when a callback function provided to chunker or splitter fails during execution.

CustomSplitterRegistry

Methods:

  • clear

    Clears all registered splitters from the registry.

  • is_registered

    Check if a splitter is registered for the given language.

  • register

    Register a splitter callback for one or more languages.

  • split

    Processes a text using a splitter registered for the given language.

  • unregister

    Remove splitter(s) from the registry.

Attributes:

  • splitters

    Returns a shallow copy of the dictionary of registered splitters.

splitters property

splitters

Returns a shallow copy of the dictionary of registered splitters.

This prevents external modification of the internal registry state.

clear

clear() -> None

Clears all registered splitters from the registry.

Source code in src/chunklet/sentence_splitter/registry.py
def clear(self) -> None:
    """
    Clears all registered splitters from the registry.
    """
    self._splitters.clear()

is_registered

is_registered(lang: str) -> bool

Check if a splitter is registered for the given language.

Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def is_registered(self, lang: str) -> bool:
    """
    Check if a splitter is registered for the given language.
    """
    return lang in self._splitters

register

register(*args: Any, name: str | None = None)

Register a splitter callback for one or more languages.

This method can be used in two ways:

  1. As a decorator: @registry.register("en", "fr", name="my_splitter") def my_splitter(text): ...

  2. As a direct function call: registry.register(my_splitter, "en", "fr", name="my_splitter")

Parameters:

  • *args

    (Any, default: () ) –

    The arguments, which can be either (lang1, lang2, ...) for a decorator or (callback, lang1, lang2, ...) for a direct call.

  • name

    (str | None, default: None ) –

    The name of the splitter. If None, attempts to use the callback's name.

Source code in src/chunklet/sentence_splitter/registry.py
def register(self, *args: Any, name: str | None = None):
    """
    Register a splitter callback for one or more languages.

    This method can be used in two ways:

    1. As a decorator:
        @registry.register("en", "fr", name="my_splitter")
        def my_splitter(text):
            ...

    2. As a direct function call:
        registry.register(my_splitter, "en", "fr", name="my_splitter")

    Args:
        *args: The arguments, which can be either (lang1, lang2, ...) for a decorator
               or (callback, lang1, lang2, ...) for a direct call.
        name: The name of the splitter. If None, attempts to use the callback's name.
    """
    if not args:
        raise ValueError("At least one language or a callback must be provided.")

    if callable(args[0]):
        # Direct call: register(callback, lang1, lang2, ...)
        callback = args[0]
        langs = args[1:]
        if not langs:
            raise ValueError(
                "At least one language must be provided for the callback."
            )
        self._register_logic(langs, callback, name)
        return callback
    else:
        # Decorator: @register(lang1, lang2, ...)
        langs = args

        def decorator(cb: Callable):
            self._register_logic(langs, cb, name)
            return cb

        return decorator

split

split(text: str, lang: str) -> tuple[list[str], str]

Processes a text using a splitter registered for the given language.

Parameters:

  • text

    (str) –

    The text to split.

  • lang

    (str) –

    The language of the text.

Returns:

  • tuple[list[str], str]

    A tuple containing a list of sentences and the name of the splitter used.

Raises:

  • CallbackError

    If the splitter callback fails.

  • TypeError

    If the splitter returns the wrong type.

Examples:

>>> from chunklet.sentence_splitter import CustomSplitterRegistry
>>> registry = CustomSplitterRegistry()
>>> @registry.register("xx", name="custom_splitter")
... def custom_splitter(text: str) -> list[str]:
...     return text.split(" ")
>>> registry.split("Hello World", "xx")
(['Hello', 'World'], 'custom_splitter')
Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def split(self, text: str, lang: str) -> tuple[list[str], str]:
    """
    Processes a text using a splitter registered for the given language.

    Args:
        text: The text to split.
        lang: The language of the text.

    Returns:
        A tuple containing a list of sentences and the name of the splitter used.

    Raises:
        CallbackError: If the splitter callback fails.
        TypeError: If the splitter returns the wrong type.

    Examples:
        >>> from chunklet.sentence_splitter import CustomSplitterRegistry
        >>> registry = CustomSplitterRegistry()
        >>> @registry.register("xx", name="custom_splitter")
        ... def custom_splitter(text: str) -> list[str]:
        ...     return text.split(" ")
        >>> registry.split("Hello World", "xx")
        (['Hello', 'World'], 'custom_splitter')
    """
    splitter_info = self._splitters.get(lang)
    if not splitter_info:
        raise CallbackError(
            f"No splitter registered for language '{lang}'.\n"
            f"💡Hint: Use `.register('{lang}', fn=your_function)` first."
        )

    name, callback = splitter_info

    try:
        # Validate the return type
        result = callback(text)
        validator = TypeAdapter(list[str])
        validator.validate_python(result)
    except ValidationError as e:
        e.subtitle = f"{name} result"
        e.hint = "💡Hint: Make sure your splitter returns a list of strings."
        raise CallbackError(f"{pretty_errors(e)}.\n") from None
    except Exception as e:
        raise CallbackError(
            f"Splitter '{name}' for lang '{lang}' raised an exception.\nDetails: {e}"
        ) from None

    return result, name

unregister

unregister(*langs: str) -> None

Remove splitter(s) from the registry.

Parameters:

  • *langs

    (str, default: () ) –

    Language codes to remove

Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def unregister(self, *langs: str) -> None:
    """
    Remove splitter(s) from the registry.

    Args:
        *langs: Language codes to remove
    """
    for lang in langs:
        self._splitters.pop(lang, None)

SentenceSplitter

SentenceSplitter(verbose: bool = False)

Bases: BaseSplitter

A robust and versatile utility dedicated to precisely segmenting text into individual sentences.

Key Features: - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Custom Splitters: Uses centralized registry for custom splitting logic. - Fallback Mechanism: Employs a universal rule-based splitter for unsupported languages. - Robust Error Handling: Provides clear error reporting for issues with custom splitters. - Intelligent Post-processing: Cleans up split sentences by filtering empty strings and rejoining stray punctuation.

Initializes the SentenceSplitter.

Parameters:

  • verbose

    (bool, default: False ) –

    If True, enables verbose logging for debugging and informational messages.

Methods:

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@validate_input
def __init__(self, verbose: bool = False):
    """
    Initializes the SentenceSplitter.

    Args:
        verbose: If True, enables verbose logging for debugging and informational messages.
    """
    self.verbose = verbose
    self.fallback_splitter = UniversalSplitter()

    # Create a normalized identifier for language detection
    self._identifier = LanguageIdentifier.from_pickled_model(
        MODEL_FILE, norm_probs=True
    )

    # Tracked to reduce log spamming about language detection
    self._last_lang_used = None

detected_top_language

detected_top_language(text: str) -> tuple[str, float]

Detects the top language of the given text using py3langid.

Parameters:

  • text

    (str) –

    The input text to detect the language for.

Returns:

  • tuple[str, float]

    A tuple containing the detected language code and its confidence.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@validate_input
def detected_top_language(self, text: str) -> tuple[str, float]:
    """
    Detects the top language of the given text using py3langid.

    Args:
        text: The input text to detect the language for.

    Returns:
        A tuple containing the detected language code and its confidence.
    """
    lang_detected, confidence = self._identifier.classify(text)
    log_info(
        self.verbose,
        "Language detection: '{}' with confidence {}.",
        lang_detected,
        f"{round(confidence) * 10}/10",
    )
    return lang_detected, confidence

split_file

split_file(
    path: str | Path, lang: str = "auto"
) -> list[str]

Read and split a file into sentences.

Parameters:

  • path

    (str | Path) –

    Path to the file to read.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to 'auto'.

Returns:

  • list[str]

    A list of sentences extracted from the file.

Source code in src/chunklet/sentence_splitter/sentence_splitter.py
def split_file(self, path: str | Path, lang: str = "auto") -> list[str]:
    """
    Read and split a file into sentences.

    Args:
        path: Path to the file to read.
        lang: The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to 'auto'.

    Returns:
        A list of sentences extracted from the file.
    """
    content = read_text_file(path)
    return self.split_text(content, lang)

split_text

split_text(text: str, lang: str = 'auto') -> list[str]

Splits a given text into a list of sentences.

Parameters:

  • text

    (str) –

    The input text to be split.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'

Returns:

  • list[str]

    A list of sentences.

Examples:

>>> splitter = SentenceSplitter()
>>> splitter.split_text("Hello world. How are you?", "en")
['Hello world.', 'How are you?']
>>> splitter.split_text("Bonjour le monde. Comment allez-vous?", "fr")
['Bonjour le monde.', 'Comment allez-vous?']
>>> splitter.split_text("Hello world. How are you?", "auto")
['Hello world.', 'How are you?']
Source code in src/chunklet/sentence_splitter/sentence_splitter.py
@validate_input
def split_text(self, text: str, lang: str = "auto") -> list[str]:
    """
    Splits a given text into a list of sentences.

    Args:
        text: The input text to be split.
        lang: The language of the text (e.g., 'en', 'fr'). Defaults to 'auto'

    Returns:
        A list of sentences.

    Examples:
        >>> splitter = SentenceSplitter()
        >>> splitter.split_text("Hello world. How are you?", "en")
        ['Hello world.', 'How are you?']
        >>> splitter.split_text("Bonjour le monde. Comment allez-vous?", "fr")
        ['Bonjour le monde.', 'Comment allez-vous?']
        >>> splitter.split_text("Hello world. How are you?", "auto")
        ['Hello world.', 'How are you?']
    """
    if not text:
        log_info(self.verbose, "Input text is empty. Returning empty list.")
        return []

    if lang == "auto":
        if self._last_lang_used is None:
            logger.warning(
                "The language is set to `auto`. Consider setting the `lang` parameter "
                "to a specific language to improve reliability."
            )
        lang_detected, confidence = self.detected_top_language(text)
        lang = lang_detected if confidence >= 0.7 else "fallback"

    self._last_lang_used = lang

    sentences = None
    if lang != "fallback":
        # Prioritize custom splitters from registry
        if custom_splitter_registry.is_registered(lang):
            sentences, splitter_name = custom_splitter_registry.split(text, lang)
            log_info(self.verbose, "Using registered splitter: {}", splitter_name)
        elif (handler := self._get_special_lang_handler(lang, self.verbose)) is not None:
            sentences = handler(text)

    # If no handler found, use fallback
    if sentences is None:
        logger.warning(
            "Using a universal rule-based splitter.\n"
            "Reason: Language not supported or detected with low confidence."
        )
        sentences = self.fallback_splitter.split(text)

    cleaned_sentences = self._clean_sentences(sentences)
    log_info(
        self.verbose,
        "Text splitted into sentences. Total sentences detected: {}",
        len(cleaned_sentences),
    )
    return cleaned_sentences

UniversalSplitter

UniversalSplitter()

Language-agnostic sentence boundary detector using regex patterns.

A universal splitter using Unicode-aware regex patterns for any language.

Handles
  • Unicode sentence terminators
  • Numbered lists and headings
  • Quoted sentences
  • Line breaks and whitespace
Use cases
  • Primary splitter for languages without dedicated support
  • Fallback when language-specific splitters unavailable

Methods:

  • split

    Splits text into sentences using rule-based regex patterns.

Source code in src/chunklet/sentence_splitter/_universal_splitter.py
def __init__(self):
    self.sentence_terminators = "".join(GLOBAL_SENTENCE_TERMINATORS)
    self.flattened_numbered_list_pattern = re.compile(
        rf"(?<=[{self.sentence_terminators}:])\s+(\p{{N}}\.)+"
    )

    self.quote_or_paren_pattern = re.compile(
        r"(\p{Pi}|['\"]).+?(\p{Pf}|\1)|"
        r"\p{Ps}.+?\p{Pe}",
        re.DOTALL,
    )

    self.hashed_pattern = re.compile(r"##-?\d+##")
    self.numbered_list_pattern = re.compile(r"[\n:]\s*\p{N}\.")

    # Core sentence split regex
    self.sentence_end_pattern = re.compile(
        rf"""
        (?<!\b(\p{{Lu}}\p{{Ll}}{{1, 4}}\.)*)   # Latin-only abbreviation
        (?<=[{self.sentence_terminators}])       # sentence-ending punctuation
        (?=\s+[\p{{Lu}}\p{{Lo}}\p{{Lt}}]|\s*\n|\s*$)  # followed by letter (upper or catch-all) or end
        """,
        re.VERBOSE,
    )

split

split(text: str) -> list[str]

Splits text into sentences using rule-based regex patterns.

Parameters:

  • text

    (str) –

    The input text to be segmented into sentences.

Returns:

  • list[str]

    A list of sentences after segmentation.

Source code in src/chunklet/sentence_splitter/_universal_splitter.py
def split(self, text: str) -> list[str]:
    """
    Splits text into sentences using rule-based regex patterns.

    Args:
        text: The input text to be segmented into sentences.

    Returns:
        A list of sentences after segmentation.
    """
    def mask(match: re.Match, norm_map: dict):
        # Generate the integer hash and Convert to string 
        # because re.sub MUST return a string
        # Also fence them for easy detection
        hashed_str = f"##{hash(match.group())}##"

        # Store the mapping for later reconstruction
        norm_map[hashed_str] = match.group()
        return hashed_str

    def unmask(match: re.Match, norm_map: dict):
        return norm_map.get(match.group(), match.group())

    text = self.flattened_numbered_list_pattern.sub(r"\n \1", text.strip())

    # Normalize to protect them 
    norm_map = {}
    text = self.quote_or_paren_pattern.sub(
        lambda m: mask(m, norm_map), text
    )
    text = self.numbered_list_pattern.sub(
        lambda m: mask(m, norm_map), text
    )

    # Firstly, split base on punctuation
    # then split further on newline
    final_sentences = []
    sentences = self.sentence_end_pattern.split(text.strip())
    for sent in sentences:
        if sent:
            final_sentences.extend(sent.strip().splitlines())

    # Restore the normalization
    return [
        self.hashed_pattern.sub(lambda m: unmask(m, norm_map), sent)
        for sent in final_sentences if sent.strip()
    ]

deprecated_callable

deprecated_callable(
    use_instead: str, deprecated_in: str, removed_in: str
) -> Callable

Decorate a function or class with warning message.

This decorator marks a function or class as deprecated.

Parameters:

  • use_instead

    (str) –

    Replacement name (e.g., "split_text", "DocumentChunker", or "chunk_text or chunk_file").

  • deprecated_in

    (str) –

    Version when the function was deprecated (e.g., "2.2.0").

  • removed_in

    (str) –

    Version when the function will be removed (e.g., "3.0.0").

Returns:

  • Callable

    Decorator function that wraps the source function/class.

Source code in src/chunklet/common/deprecation.py
def deprecated_callable(
    use_instead: str,
    deprecated_in: str,
    removed_in: str,
) -> Callable:
    """Decorate a function or class with warning message.

    This decorator marks a function or class as deprecated.

    Args:
        use_instead: Replacement name (e.g., "split_text", "DocumentChunker", or "chunk_text or chunk_file").
        deprecated_in: Version when the function was deprecated (e.g., "2.2.0").
        removed_in: Version when the function will be removed (e.g., "3.0.0").

    Returns:
        Decorator function that wraps the source function/class.
    """

    def decorator(func_or_cls: Callable) -> Callable:
        warn_message = (
            f"`{func_or_cls.__qualname__}` was deprecated since v{deprecated_in} "
            f"in favor of `{use_instead}`. It will be removed in v{removed_in}."
        )
        remove_message = (
            f"`{func_or_cls.__qualname__}` was removed in v{removed_in}. "
            f"Use `{use_instead}` instead."
        )

        @functools.wraps(func_or_cls)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            if Version(CURRENT_VERSION) >= Version(removed_in):
                raise AttributeError(remove_message)
            warnings.warn(warn_message, FutureWarning, stacklevel=2)
            return func_or_cls(*args, **kwargs)

        return wrapper

    return decorator

log_info

log_info(verbose: bool, *args, **kwargs) -> None

Log an info message if verbose is enabled.

This is a convenience function that only logs when verbose mode is enabled, avoiding unnecessary log output in production.

Parameters:

  • verbose

    (bool) –

    If True, logs the message; if False, does nothing.

  • *args

    Positional arguments passed to logger.info().

  • **kwargs

    Keyword arguments passed to logger.info().

Example

log_info(True, "Processing file: {}", filepath) Processing file: /path/to/file log_info(False, "This will not be logged") (no output)

Source code in src/chunklet/common/logging_utils.py
def log_info(verbose: bool, *args, **kwargs) -> None:
    """Log an info message if verbose is enabled.

    This is a convenience function that only logs when verbose mode is enabled,
    avoiding unnecessary log output in production.

    Args:
        verbose: If True, logs the message; if False, does nothing.
        *args: Positional arguments passed to logger.info().
        **kwargs: Keyword arguments passed to logger.info().

    Example:
        >>> log_info(True, "Processing file: {}", filepath)
        Processing file: /path/to/file
        >>> log_info(False, "This will not be logged")
        (no output)
    """
    if verbose:
        logger.info(*args, **kwargs)

pretty_errors

pretty_errors(error: ValidationError) -> str

Formats Pydantic validation errors into a human-readable string.

Source code in src/chunklet/common/validation.py
def pretty_errors(error: ValidationError) -> str:
    """Formats Pydantic validation errors into a human-readable string."""
    lines = [
        f"{error.error_count()} validation error for {getattr(error, 'subtitle', '') or error.title}."
    ]
    for ind, err in enumerate(error.errors(), start=1):
        msg = err["msg"]

        loc = err.get("loc", [])
        formatted_loc = ""
        if len(loc) >= 1:
            formatted_loc = str(loc[0]) + "".join(f"[{step!r}]" for step in loc[1:])
            formatted_loc = f"({formatted_loc})" if formatted_loc else ""

        input_value = err["input"]
        input_type = type(input_value).__name__

        # Use reprlib for auto-truncation on non-strings (faster for lists/dicts/nested)
        if not isinstance(input_value, str):
            input_value = reprlib.repr(input_value)
        else:
            input_value = input_value if len(input_value) < 500 else input_value[:500] + "..."

        lines.append(
            (
                f"{ind}) {formatted_loc} {msg}.\n"
                f"  Found: (input={input_value!r}, type={input_type})"
            )
        )

    lines.append("  " + getattr(error, "hint", ""))
    return "\n".join(lines)

read_text_file

read_text_file(path: str | Path) -> str

Read text file with automatic encoding detection.

Parameters:

  • path

    (str | Path) –

    File path to read.

Returns:

  • str

    File content.

Raises:

Source code in src/chunklet/common/path_utils.py
@validate_input
def read_text_file(path: str | Path) -> str:
    """Read text file with automatic encoding detection.

    Args:
        path: File path to read.

    Returns:
        File content.

    Raises:
        FileProcessingError: If file cannot be read.
    """
    from charset_normalizer import from_path

    path = Path(path)

    if not path.exists():
        raise FileProcessingError(f"File does not exist: {path}")

    if _is_binary_file(path):
        raise FileProcessingError(f"Binary file not supported: {path}")

    match = from_path(str(path)).best()
    return str(match) if match else ""

validate_input

validate_input(fn)

A decorator that validates function inputs and outputs

A wrapper around Pydantic's validate_call that catchesValidationError and re-raises it as a more user-friendly InvalidInputError.

Source code in src/chunklet/common/validation.py
def validate_input(fn):
    """
    A decorator that validates function inputs and outputs

    A wrapper around Pydantic's `validate_call` that catches`ValidationError` and re-raises it as a more user-friendly `InvalidInputError`.
    """
    validated_fn = validate_call(fn, config=ConfigDict(arbitrary_types_allowed=True))

    @wraps(fn)
    def wrapper(*args, **kwargs):
        try:
            return validated_fn(*args, **kwargs)
        except ValidationError as e:
            raise InvalidInputError(pretty_errors(e)) from None

    return wrapper