Skip to content

chunklet.sentence_splitter.registry

Classes:

CustomSplitterRegistry

Methods:

  • clear

    Clears all registered splitters from the registry.

  • is_registered

    Check if a splitter is registered for the given language.

  • register

    Register a splitter callback for one or more languages.

  • split

    Processes a text using a splitter registered for the given language.

  • unregister

    Remove splitter(s) from the registry.

Attributes:

  • splitters

    Returns a shallow copy of the dictionary of registered splitters.

splitters property

splitters

Returns a shallow copy of the dictionary of registered splitters.

This prevents external modification of the internal registry state.

clear

clear() -> None

Clears all registered splitters from the registry.

Source code in src/chunklet/sentence_splitter/registry.py
def clear(self) -> None:
    """
    Clears all registered splitters from the registry.
    """
    self._splitters.clear()

is_registered

is_registered(lang: str) -> bool

Check if a splitter is registered for the given language.

Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def is_registered(self, lang: str) -> bool:
    """
    Check if a splitter is registered for the given language.
    """
    return lang in self._splitters

register

register(*args: Any, name: str | None = None)

Register a splitter callback for one or more languages.

This method can be used in two ways: 1. As a decorator: @registry.register("en", "fr", name="my_splitter") def my_splitter(text): ...

  1. As a direct function call: registry.register(my_splitter, "en", "fr", name="my_splitter")

Parameters:

  • *args

    (Any, default: () ) –

    The arguments, which can be either (lang1, lang2, ...) for a decorator or (callback, lang1, lang2, ...) for a direct call.

  • name

    (str, default: None ) –

    The name of the splitter. If None, attempts to use the callback's name.

Source code in src/chunklet/sentence_splitter/registry.py
def register(self, *args: Any, name: str | None = None):
    """
    Register a splitter callback for one or more languages.

    This method can be used in two ways:
    1. As a decorator:
        @registry.register("en", "fr", name="my_splitter")
        def my_splitter(text):
            ...

    2. As a direct function call:
        registry.register(my_splitter, "en", "fr", name="my_splitter")

    Args:
        *args: The arguments, which can be either (lang1, lang2, ...) for a decorator
               or (callback, lang1, lang2, ...) for a direct call.
        name (str, optional): The name of the splitter. If None, attempts to use the callback's name.
    """
    if not args:
        raise ValueError("At least one language or a callback must be provided.")

    if callable(args[0]):
        # Direct call: register(callback, lang1, lang2, ...)
        callback = args[0]
        langs = args[1:]
        if not langs:
            raise ValueError(
                "At least one language must be provided for the callback."
            )
        self._register_logic(langs, callback, name)
        return callback
    else:
        # Decorator: @register(lang1, lang2, ...)
        langs = args

        def decorator(cb: Callable):
            self._register_logic(langs, cb, name)
            return cb

        return decorator

split

split(text: str, lang: str) -> tuple[list[str], str]

Processes a text using a splitter registered for the given language.

Parameters:

  • text

    (str) –

    The text to split.

  • lang

    (str) –

    The language of the text.

Returns:

  • tuple[list[str], str]

    tuple[list[str], str]: A tuple containing a list of sentences and the name of the splitter used.

Raises:

  • CallbackError

    If the splitter callback fails.

  • TypeError

    If the splitter returns the wrong type.

Examples:

>>> from chunklet.sentence_splitter import CustomSplitterRegistry
>>> registry = CustomSplitterRegistry()
>>> @registry.register("xx", name="custom_splitter")
... def custom_splitter(text: str) -> list[str]:
...     return text.split(" ")
>>> registry.split("Hello World", "xx")
(['Hello', 'World'], 'custom_splitter')
Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def split(self, text: str, lang: str) -> tuple[list[str], str]:
    """
    Processes a text using a splitter registered for the given language.

    Args:
        text (str): The text to split.
        lang (str): The language of the text.

    Returns:
        tuple[list[str], str]: A tuple containing a list of sentences and the name of the splitter used.

    Raises:
        CallbackError: If the splitter callback fails.
        TypeError: If the splitter returns the wrong type.

    Examples:
        >>> from chunklet.sentence_splitter import CustomSplitterRegistry
        >>> registry = CustomSplitterRegistry()
        >>> @registry.register("xx", name="custom_splitter")
        ... def custom_splitter(text: str) -> list[str]:
        ...     return text.split(" ")
        >>> registry.split("Hello World", "xx")
        (['Hello', 'World'], 'custom_splitter')
    """
    splitter_info = self._splitters.get(lang)
    if not splitter_info:
        raise CallbackError(
            f"No splitter registered for language '{lang}'.\n"
            f"💡Hint: Use `.register('{lang}', fn=your_function)` first."
        )

    name, callback = splitter_info

    try:
        # Validate the return type
        result = callback(text)
        validator = TypeAdapter(list[str])
        validator.validate_python(result)
    except ValidationError as e:
        e.subtitle = f"{name} result"
        e.hint = "💡Hint: Make sure your splitter returns a list of strings."
        raise CallbackError(f"{pretty_errors(e)}.\n") from None
    except Exception as e:
        raise CallbackError(
            f"Splitter '{name}' for lang '{lang}' raised an exception.\nDetails: {e}"
        ) from None

    return result, name

unregister

unregister(*langs: str) -> None

Remove splitter(s) from the registry.

Parameters:

  • *langs

    (str, default: () ) –

    Language codes to remove

Source code in src/chunklet/sentence_splitter/registry.py
@validate_input
def unregister(self, *langs: str) -> None:
    """
    Remove splitter(s) from the registry.

    Args:
        *langs: Language codes to remove
    """
    for lang in langs:
        self._splitters.pop(lang, None)