Skip to content

chunklet.plain_text_chunker

Classes:

  • PlainTextChunker

    A powerful text chunking utility offering flexible strategies for optimal text segmentation.

PlainTextChunker

PlainTextChunker(
    sentence_splitter: Any | None = None,
    verbose: bool = False,
    continuation_marker: str = "...",
    token_counter: Callable[[str], int] | None = None,
)

A powerful text chunking utility offering flexible strategies for optimal text segmentation.

Key Features: - Flexible Constraint-Based Chunking: Segment text by specifying limits on sentence count, token count and section breaks or combination of them. - Clause-Level Overlap: Ensures semantic continuity between chunks by overlapping at natural clause boundaries with Customizable continuation marker. - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Pluggable Token Counters: Integrate custom token counting functions (e.g., for specific LLM tokenizers). - Parallel Processing: Efficiently handles batch chunking of multiple texts using multiprocessing. - Memory friendly batching: Yields chunks one at a time, reducing memory usage, especially for very large documents.

Initialize The PlainTextChunker.

Parameters:

  • sentence_splitter

    (BaseSplitter, default: None ) –

    An optional BaseSplitter instance. If None, a default SentenceSplitter will be initialized.

  • verbose

    (bool, default: False ) –

    Enable verbose logging.

  • continuation_marker

    (str, default: '...' ) –

    The marker to prepend to unfitted clauses. Defaults to '...'.

  • token_counter

    (Callable[[str], int], default: None ) –

    Function that counts tokens in text. If None, must be provided when calling chunk() methods.

Raises:

  • InvalidInputError

    If any of the input arguments are invalid or if the provided sentence_splitter is not an instance of BaseSplitter.

Methods:

  • batch_chunk

    Processes a batch of texts in parallel, splitting each into chunks.

  • chunk

    Chunks a single text into smaller pieces based on specified parameters.

Attributes:

  • verbose (bool) –

    Get the verbosity status.

Source code in src/chunklet/plain_text_chunker.py
@validate_input
def __init__(
    self,
    sentence_splitter: Any | None = None,
    verbose: bool = False,
    continuation_marker: str = "...",
    token_counter: Callable[[str], int] | None = None,
):
    """
    Initialize The PlainTextChunker.

    Args:
        sentence_splitter (BaseSplitter, optional): An optional BaseSplitter instance.
            If None, a default SentenceSplitter will be initialized.
        verbose (bool): Enable verbose logging.
        continuation_marker (str): The marker to prepend to unfitted clauses. Defaults to '...'.
        token_counter (Callable[[str], int], optional): Function that counts tokens in text.
            If None, must be provided when calling chunk() methods.

    Raises:
        InvalidInputError: If any of the input arguments are invalid or if the provided `sentence_splitter` is not an instance of `BaseSplitter`.
    """
    self._verbose = verbose
    self.token_counter = token_counter
    self.continuation_marker = continuation_marker

    if sentence_splitter is not None and not isinstance(
        sentence_splitter, BaseSplitter
    ):
        raise InvalidInputError(
            f"The provided sentence_splitter must be an instance of BaseSplitter, "
            f"but got {type(sentence_splitter).__name__}."
        )

    # Initialize SentenceSplitter
    self.sentence_splitter = sentence_splitter or SentenceSplitter()
    self.sentence_splitter.verbose = self._verbose

verbose property writable

verbose: bool

Get the verbosity status.

batch_chunk

batch_chunk(
    texts: restricted_iterable(str),
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[
        int | None, Field(ge=1)
    ] = None,
    max_section_breaks: Annotated[
        int | None, Field(ge=1)
    ] = None,
    overlap_percent: Annotated[
        int, Field(ge=0, le=75)
    ] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    base_metadata: dict[str, Any] | None = None,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise"
) -> Generator[Any, None, None]

Processes a batch of texts in parallel, splitting each into chunks. Leverages multiprocessing for efficient batch chunking.

If a task fails, chunklet will now stop processing and return the results of the tasks that completed successfully, preventing wasted work.

Parameters:

  • texts

    (restricted_iterable[str]) –

    A restricted iterable of input texts to be chunked.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".

  • max_tokens

    (int, default: None ) –

    Maximum number of tokens per chunk. Must be >= 12.

  • max_sentences

    (int, default: None ) –

    Maximum number of sentences per chunk. Must be >= 1.

  • max_section_breaks

    (int, default: None ) –

    Maximum number of section breaks per chunk. Must be >= 1.

  • overlap_percent

    (int | float, default: 20 ) –

    Percentage of overlap between chunks (0-85).

  • offset

    (int, default: 0 ) –

    Starting sentence offset for chunking. Defaults to 0.

  • token_counter

    (callable, default: None ) –

    The token counting function. Required if max_tokens is set.

  • separator

    (Any, default: None ) –

    A value to be yielded after the chunks of each text are processed. Note: None cannot be used as a separator.

  • base_metadata

    (dict[str, Any], default: None ) –

    Optional dictionary to be included with each chunk.

  • n_jobs

    (int | None, default: None ) –

    Number of parallel workers to use. If None, uses all available CPUs. Must be >= 1 if specified.

  • show_progress

    (bool, default: True ) –

    Flag to show or disable the loading bar.

  • on_errors

    (Literal['raise', 'skip', 'break'], default: 'raise' ) –

    How to handle errors during processing. Defaults to 'raise'.

Yields:

  • Any ( Any ) –

    A Box object containing the chunk content and metadata, or any separator object.

Raises:

  • InvalidInputError

    If texts is not an iterable of strings, or if n_jobs is less than 1.

  • MissingTokenCounterError

    If max_tokens is provided but no token_counter is provided.

  • CallbackError

    If an error occurs during sentence splitting or token counting within a chunking task.

Source code in src/chunklet/plain_text_chunker.py
@validate_input
def batch_chunk(
    self,
    texts: restricted_iterable(str),
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[int | None, Field(ge=1)] = None,
    max_section_breaks: Annotated[int | None, Field(ge=1)] = None,
    overlap_percent: Annotated[int, Field(ge=0, le=75)] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    base_metadata: dict[str, Any] | None = None,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise",
) -> Generator[Any, None, None]:
    """
    Processes a batch of texts in parallel, splitting each into chunks.
    Leverages multiprocessing for efficient batch chunking.

    If a task fails, `chunklet` will now stop processing and return the results
    of the tasks that completed successfully, preventing wasted work.

    Args:
        texts (restricted_iterable[str]): A restricted iterable of input texts to be chunked.
        lang (str): The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
        max_tokens (int, optional): Maximum number of tokens per chunk. Must be >= 12.
        max_sentences (int, optional): Maximum number of sentences per chunk. Must be >= 1.
        max_section_breaks (int, optional): Maximum number of section breaks per chunk. Must be >= 1.
        overlap_percent (int | float): Percentage of overlap between chunks (0-85).
        offset (int): Starting sentence offset for chunking. Defaults to 0.
        token_counter (callable, optional): The token counting function.
            Required if `max_tokens` is set.
        separator (Any): A value to be yielded after the chunks of each text are processed.
            Note: None cannot be used as a separator.
        base_metadata (dict[str, Any], optional): Optional dictionary to be included with each chunk.
        n_jobs (int | None): Number of parallel workers to use. If None, uses all available CPUs.
            Must be >= 1 if specified.
        show_progress (bool): Flag to show or disable the loading bar.
        on_errors (Literal["raise", "skip", "break"]): How to handle errors during processing.
            Defaults to 'raise'.

    Yields:
        Any: A `Box` object containing the chunk content and metadata, or any separator object.

    Raises:
        InvalidInputError: If `texts` is not an iterable of strings, or if `n_jobs` is less than 1.
        MissingTokenCounterError: If `max_tokens` is provided but no `token_counter` is provided.
        CallbackError: If an error occurs during sentence splitting
            or token counting within a chunking task.
    """
    chunk_func = partial(
        self.chunk,
        lang=lang,
        max_tokens=max_tokens,
        max_sentences=max_sentences,
        overlap_percent=overlap_percent,
        max_section_breaks=max_section_breaks,
        offset=offset,
        base_metadata=base_metadata,
        token_counter=token_counter or self.token_counter,
    )

    yield from run_in_batch(
        func=chunk_func,
        iterable_of_args=texts,
        iterable_name="texts",
        n_jobs=n_jobs,
        show_progress=show_progress,
        on_errors=on_errors,
        separator=separator,
        verbose=self.verbose,
    )

chunk

chunk(
    text: str,
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[
        int | None, Field(ge=1)
    ] = None,
    max_section_breaks: Annotated[
        int | None, Field(ge=1)
    ] = None,
    overlap_percent: Annotated[
        int, Field(ge=0, le=75)
    ] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    base_metadata: dict[str, Any] | None = None
) -> list[Box]

Chunks a single text into smaller pieces based on specified parameters. Supports flexible constraint-based chunking, clause-level overlap, and custom token counters.

Parameters:

  • text

    (str) –

    The input text to chunk.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".

  • max_tokens

    (int, default: None ) –

    Maximum number of tokens per chunk. Must be >= 12.

  • max_sentences

    (int, default: None ) –

    Maximum number of sentences per chunk. Must be >= 1.

  • max_section_breaks

    (int, default: None ) –

    Maximum number of section breaks per chunk. Must be >= 1.

  • overlap_percent

    (int | float, default: 20 ) –

    Percentage of overlap between chunks (0-75). Defaults to 20

  • offset

    (int, default: 0 ) –

    Starting sentence offset for chunking. Defaults to 0.

  • token_counter

    (callable, default: None ) –

    Optional token counting function. Required for token-based modes only.

  • base_metadata

    (dict[str, Any], default: None ) –

    Optional dictionary to be included with each chunk.

Returns:

  • list[Box]

    list[Box]: A list of Box objects, each containing the chunk content and metadata.

Raises:

Source code in src/chunklet/plain_text_chunker.py
@validate_input
def chunk(
    self,
    text: str,
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[int | None, Field(ge=1)] = None,
    max_section_breaks: Annotated[int | None, Field(ge=1)] = None,
    overlap_percent: Annotated[int, Field(ge=0, le=75)] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    base_metadata: dict[str, Any] | None = None,
) -> list[Box]:
    """
    Chunks a single text into smaller pieces based on specified parameters.
    Supports flexible constraint-based chunking, clause-level overlap,
    and custom token counters.

    Args:
        text (str): The input text to chunk.
        lang (str): The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
        max_tokens (int, optional): Maximum number of tokens per chunk. Must be >= 12.
        max_sentences (int, optional): Maximum number of sentences per chunk. Must be >= 1.
        max_section_breaks (int, optional): Maximum number of section breaks per chunk. Must be >= 1.
        overlap_percent (int | float): Percentage of overlap between chunks (0-75). Defaults to 20
        offset (int): Starting sentence offset for chunking. Defaults to 0.
        token_counter (callable, optional): Optional token counting function.
            Required for token-based modes only.
        base_metadata (dict[str, Any], optional): Optional dictionary to be included with each chunk.

    Returns:
        list[Box]: A list of `Box` objects, each containing the chunk content and metadata.

    Raises:
        InvalidInputError: If any chunking configuration parameter is invalid.
        MissingTokenCounterError: If `max_tokens` is provided but no `token_counter` is provided.
        CallbackError: If an error occurs during sentence splitting or token counting within a chunking task.
    """
    # Validate that at least one limit is provided
    if not any((max_tokens, max_sentences, max_section_breaks)):
        raise InvalidInputError(
            "At least one of 'max_tokens', 'max_sentences', or 'max_section_break' must be provided."
        )

    # If token_counter is required but not provided
    if max_tokens is not None and not (token_counter or self.token_counter):
        raise MissingTokenCounterError()

    if self.verbose:
        logger.info(
            "Starting chunk processing for text starting with: {}.",
            f"{text[:100]}...",
        )

    # Adjust limits for _group_by_chunk's internal use
    if max_tokens is None:
        max_tokens = sys.maxsize
    if max_sentences is None:
        max_sentences = sys.maxsize
    if max_section_breaks is None:
        max_section_breaks = sys.maxsize

    if not text.strip():
        if self.verbose:
            logger.info("Input text is empty. Returning empty list.")
        return []

    try:
        sentences = self.sentence_splitter.split(
            text,
            lang,
        )
    except Exception as e:
        raise CallbackError(
            f"An error occurred during the sentence splitting process.\nDetails: {e}\n"
            "💡 Hint: This may be due to an issue with the underlying sentence splitting library."
        ) from e

    if not sentences:
        return []

    offset = round(offset)
    if offset >= len(sentences):
        logger.warning(
            "Offset {} >= total sentences {}. Returning empty list.",
            offset,
            len(sentences),
        )
        return []

    chunks = self._group_by_chunk(
        sentences[offset:],
        token_counter=token_counter or self.token_counter,
        max_tokens=max_tokens,
        max_sentences=max_sentences,
        max_section_breaks=max_section_breaks,
        overlap_percent=overlap_percent,
    )

    if base_metadata is None:
        base_metadata = {}

    return self._create_chunk_boxes(chunks, base_metadata, text)