chunklet.document_chunker._plain_text_chunker

Classes:

PlainTextChunker –

A powerful text chunking utility offering flexible strategies for optimal text segmentation.

PlainTextChunker

PlainTextChunker(
    sentence_splitter: Any | None = None,
    verbose: bool = False,
    continuation_marker: str = "...",
    token_counter: Callable[[str], int] | None = None,
)

A powerful text chunking utility offering flexible strategies for optimal text segmentation.

Key Features

Flexible Constraint-Based Chunking: Segment text by specifying limits on sentence count, token count and section breaks or combination of them.
Clause-Level Overlap: Ensures semantic continuity between chunks by overlapping

at natural clause boundaries with Customizable continuation marker. - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Pluggable Token Counters: Integrate custom token counting functions (e.g., for specific LLM tokenizers). - Parallel Processing: Efficiently handles batch chunking of multiple texts using multiprocessing. - Memory friendly batching: Yields chunks one at a time, reducing memory usage, especially for very large documents.

Initialize The PlainTextChunker.

Parameters:

sentence_splitter
(Any | None, default: None ) –

An optional BaseSplitter instance. If None, a default SentenceSplitter will be initialized.
verbose
(bool, default: False ) –

Enable verbose logging.
continuation_marker
(str, default: '...' ) –

The marker to prepend to unfitted clauses. Defaults to '...'.
token_counter
(Callable[[str], int] | None, default: None ) –

Function that counts tokens in text. If None, must be provided when calling chunk() methods.

Raises:

InvalidInputError –

If any of the input arguments are invalid or if the provided sentence_splitter is not an instance of BaseSplitter.

Methods:

batch_chunk –

Processes a batch of texts in parallel, splitting each into chunks.
chunk –

Chunks a single text into smaller pieces based on specified parameters.

Attributes:

verbose (bool) –

Get the verbosity status.

Source code in src/chunklet/document_chunker/_plain_text_chunker.py

@validate_input
def __init__(
    self,
    sentence_splitter: Any | None = None,
    verbose: bool = False,
    continuation_marker: str = "...",
    token_counter: Callable[[str], int] | None = None,
):
    """
    Initialize The PlainTextChunker.

    Args:
        sentence_splitter: An optional BaseSplitter instance.
            If None, a default SentenceSplitter will be initialized.
        verbose: Enable verbose logging.
        continuation_marker: The marker to prepend to unfitted clauses. Defaults to '...'.
        token_counter: Function that counts tokens in text.
            If None, must be provided when calling chunk() methods.

    Raises:
        InvalidInputError: If any of the input arguments are invalid or if the provided `sentence_splitter` is not an instance of `BaseSplitter`.
    """
    self._verbose = verbose
    self.token_counter = token_counter
    self.continuation_marker = continuation_marker

    if sentence_splitter is not None and not isinstance(
        sentence_splitter, BaseSplitter
    ):
        raise InvalidInputError(
            f"The provided sentence_splitter must be an instance of BaseSplitter, "
            f"but got {type(sentence_splitter).__name__}."
        )

    self.sentence_splitter = sentence_splitter or SentenceSplitter()
    self.sentence_splitter.verbose = self._verbose

verbose `property` `writable`

verbose: bool

Get the verbosity status.

batch_chunk

batch_chunk(
    texts: IterableOfStr,
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[
        int | None, Field(ge=1)
    ] = None,
    max_section_breaks: Annotated[
        int | None, Field(ge=1)
    ] = None,
    overlap_percent: Annotated[
        int, Field(ge=0, le=75)
    ] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    base_metadata: dict[str, Any] | None = None,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise",
) -> Generator[Any, None, None]

Processes a batch of texts in parallel, splitting each into chunks. Leverages multiprocessing for efficient batch chunking.

If a task fails, chunklet will now stop processing and return the results of the tasks that completed successfully, preventing wasted work.

Parameters:

texts
(IterableOfStr) –

A non-string iterable of input texts to be chunked.
lang
(str, default: 'auto' ) –

The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
max_tokens
(Annotated[int | None, Field(ge=12)], default: None ) –

Maximum number of tokens per chunk. Must be >= 12.
max_sentences
(Annotated[int | None, Field(ge=1)], default: None ) –

Maximum number of sentences per chunk. Must be >= 1.
max_section_breaks
(Annotated[int | None, Field(ge=1)], default: None ) –

Maximum number of section breaks per chunk. Must be >= 1.
overlap_percent
(Annotated[int, Field(ge=0, le=75)], default: 20 ) –

Percentage of overlap between chunks (0-85).
offset
(Annotated[int, Field(ge=0)], default: 0 ) –

Starting sentence offset for chunking. Defaults to 0.
token_counter
(Callable[[str], int] | None, default: None ) –

The token counting function. Required if max_tokens is set.
separator
(Any, default: None ) –

A value to be yielded after the chunks of each text are processed. Note: None cannot be used as a separator.
base_metadata
(dict[str, Any] | None, default: None ) –

Optional dictionary to be included with each chunk.
n_jobs
(Annotated[int, Field(ge=1)] | None, default: None ) –

Number of parallel workers to use. If None, uses all available CPUs. Must be >= 1 if specified.
show_progress
(bool, default: True ) –

Flag to show or disable the loading bar.
on_errors
(Literal['raise', 'skip', 'break'], default: 'raise' ) –

How to handle errors during processing. Defaults to 'raise'.

Yields:

Any –

A DotDict object containing the chunk content and metadata, or any separator object.

Raises:

InvalidInputError –

If texts is not an iterable of strings, or if n_jobs is less than 1.
MissingTokenCounterError –

If max_tokens is provided but no token_counter is provided.
CallbackError –

If an error occurs during sentence splitting or token counting within a chunking task.

Source code in src/chunklet/document_chunker/_plain_text_chunker.py

@validate_input
def batch_chunk(
    self,
    texts: IterableOfStr,
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[int | None, Field(ge=1)] = None,
    max_section_breaks: Annotated[int | None, Field(ge=1)] = None,
    overlap_percent: Annotated[int, Field(ge=0, le=75)] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    base_metadata: dict[str, Any] | None = None,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise",
) -> Generator[Any, None, None]:
    """
    Processes a batch of texts in parallel, splitting each into chunks.
    Leverages multiprocessing for efficient batch chunking.

    If a task fails, `chunklet` will now stop processing and return the results
    of the tasks that completed successfully, preventing wasted work.

    Args:
        texts: A non-string iterable of input texts to be chunked.
        lang: The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
        max_tokens: Maximum number of tokens per chunk. Must be >= 12.
        max_sentences: Maximum number of sentences per chunk. Must be >= 1.
        max_section_breaks: Maximum number of section breaks per chunk. Must be >= 1.
        overlap_percent: Percentage of overlap between chunks (0-85).
        offset: Starting sentence offset for chunking. Defaults to 0.
        token_counter: The token counting function.
            Required if `max_tokens` is set.
        separator: A value to be yielded after the chunks of each text are processed.
            Note: None cannot be used as a separator.
        base_metadata: Optional dictionary to be included with each chunk.
        n_jobs: Number of parallel workers to use. If None, uses all available CPUs.
            Must be >= 1 if specified.
        show_progress: Flag to show or disable the loading bar.
        on_errors: How to handle errors during processing.
            Defaults to 'raise'.

    Yields:
        A `DotDict` object containing the chunk content and metadata, or any separator object.

    Raises:
        InvalidInputError: If `texts` is not an iterable of strings, or if `n_jobs` is less than 1.
        MissingTokenCounterError: If `max_tokens` is provided but no `token_counter` is provided.
        CallbackError: If an error occurs during sentence splitting
            or token counting within a chunking task.
    """
    chunk_func = partial(
        self.chunk,
        lang=lang,
        max_tokens=max_tokens,
        max_sentences=max_sentences,
        overlap_percent=overlap_percent,
        max_section_breaks=max_section_breaks,
        offset=offset,
        base_metadata=base_metadata,
        token_counter=token_counter or self.token_counter,
    )

    yield from run_in_batch(
        func=chunk_func,
        iterable_of_args=texts,
        iterable_name="texts",
        n_jobs=n_jobs,
        show_progress=show_progress,
        on_errors=on_errors,
        separator=separator,
        verbose=self.verbose,
    )

chunk

chunk(
    text: str,
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[
        int | None, Field(ge=1)
    ] = None,
    max_section_breaks: Annotated[
        int | None, Field(ge=1)
    ] = None,
    overlap_percent: Annotated[
        int, Field(ge=0, le=75)
    ] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    base_metadata: dict[str, Any] | None = None,
) -> list[DotDict]

Chunks a single text into smaller pieces based on specified parameters. Supports flexible constraint-based chunking, clause-level overlap, and custom token counters.

Parameters:

text
(str) –

The input text to chunk.
lang
(str, default: 'auto' ) –

The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
max_tokens
(Annotated[int | None, Field(ge=12)], default: None ) –

Maximum number of tokens per chunk. Must be >= 12.
max_sentences
(Annotated[int | None, Field(ge=1)], default: None ) –

Maximum number of sentences per chunk. Must be >= 1.
max_section_breaks
(Annotated[int | None, Field(ge=1)], default: None ) –

Maximum number of section breaks per chunk. Must be >= 1.
overlap_percent
(Annotated[int, Field(ge=0, le=75)], default: 20 ) –

Percentage of overlap between chunks (0-75). Defaults to 20
offset
(Annotated[int, Field(ge=0)], default: 0 ) –

Starting sentence offset for chunking. Defaults to 0.
token_counter
(Callable[[str], int] | None, default: None ) –

Optional token counting function. Required for token-based modes only.
base_metadata
(dict[str, Any] | None, default: None ) –

Optional dictionary to be included with each chunk.

Returns:

list[DotDict] –

A list of DotDict objects, each containing the chunk content and metadata.

Raises:

InvalidInputError –

If any chunking configuration parameter is invalid.
MissingTokenCounterError –

If max_tokens is provided but no token_counter is provided.
CallbackError –

If an error occurs during sentence splitting or token counting within a chunking task.

Source code in src/chunklet/document_chunker/_plain_text_chunker.py

@validate_input
def chunk(
    self,
    text: str,
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[int | None, Field(ge=1)] = None,
    max_section_breaks: Annotated[int | None, Field(ge=1)] = None,
    overlap_percent: Annotated[int, Field(ge=0, le=75)] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    base_metadata: dict[str, Any] | None = None,
) -> list[DotDict]:
    """
    Chunks a single text into smaller pieces based on specified parameters.
    Supports flexible constraint-based chunking, clause-level overlap,
    and custom token counters.

    Args:
        text: The input text to chunk.
        lang: The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
        max_tokens: Maximum number of tokens per chunk. Must be >= 12.
        max_sentences: Maximum number of sentences per chunk. Must be >= 1.
        max_section_breaks: Maximum number of section breaks per chunk. Must be >= 1.
        overlap_percent: Percentage of overlap between chunks (0-75). Defaults to 20
        offset: Starting sentence offset for chunking. Defaults to 0.
        token_counter: Optional token counting function.
            Required for token-based modes only.
        base_metadata: Optional dictionary to be included with each chunk.

    Returns:
        A list of `DotDict` objects, each containing the chunk content and metadata.

    Raises:
        InvalidInputError: If any chunking configuration parameter is invalid.
        MissingTokenCounterError: If `max_tokens` is provided but no `token_counter` is provided.
        CallbackError: If an error occurs during sentence splitting or token counting within a chunking task.
    """
    self._validate_constraints(
        max_tokens, max_sentences, max_section_breaks, token_counter
    )

    log_info(
        self.verbose,
        "Starting chunk processing for text starting with: {}.",
        f"{text[:100]}...",
    )

    # Adjust limits for _group_by_chunk's internal use
    if max_tokens is None:
        max_tokens = sys.maxsize
    if max_sentences is None:
        max_sentences = sys.maxsize
    if max_section_breaks is None:
        max_section_breaks = sys.maxsize

    if not text.strip():
        log_info(self.verbose, "Input text is empty. Returning empty list.")
        return []

    try:
        sentences = self.sentence_splitter.split_text(
            text,
            lang,
        )
    except Exception as e:
        raise CallbackError(
            f"An error occurred during the sentence splitting process.\nDetails: {e}\n"
            "💡 Hint: This may be due to an issue with the underlying sentence splitting library."
        ) from e

    if not sentences:
        return []

    offset = round(offset)
    if offset >= len(sentences):
        logger.warning(
            "Offset {} >= total sentences {}. Returning empty list.",
            offset,
            len(sentences),
        )
        return []

    chunks = self._group_by_chunk(
        sentences[offset:],
        token_counter=token_counter or self.token_counter,
        max_tokens=max_tokens,
        max_sentences=max_sentences,
        max_section_breaks=max_section_breaks,
        overlap_percent=overlap_percent,
    )

    # Note: We use DeterministicSpanFinder because sentence splitter may modify text
    # (e.g., normalize whitespace, fix encoding), making exact span tracking difficult.
    span_finder = DeterministicSpanFinder(text)
    return self._create_chunks(chunks, base_metadata or {}, span_finder)

chunklet.document_chunker._plain_text_chunker

PlainTextChunker

`sentence_splitter`

`verbose`

`continuation_marker`

`token_counter`

verbose `property` `writable`

batch_chunk

`texts`

`lang`

`max_tokens`

`max_sentences`

`max_section_breaks`

`overlap_percent`

`offset`

`token_counter`

`separator`

`base_metadata`

`n_jobs`

`show_progress`

`on_errors`

chunk

`text`

`lang`

`max_tokens`

`max_sentences`

`max_section_breaks`

`overlap_percent`

`offset`

`token_counter`

`base_metadata`

chunklet.document_chunker._plain_text_chunker

PlainTextChunker

sentence_splitter

verbose

continuation_marker

token_counter

verbose property writable

batch_chunk

texts

lang

max_tokens

max_sentences

max_section_breaks

overlap_percent

offset

token_counter

separator

base_metadata

n_jobs

show_progress

on_errors

chunk

text

lang

max_tokens

max_sentences

max_section_breaks

overlap_percent

offset

token_counter

base_metadata

`sentence_splitter`

`verbose`

`continuation_marker`

`token_counter`

verbose `property` `writable`

`texts`

`lang`

`max_tokens`

`max_sentences`

`max_section_breaks`

`overlap_percent`

`offset`

`token_counter`

`separator`

`base_metadata`

`n_jobs`

`show_progress`

`on_errors`

`text`

`lang`

`max_tokens`

`max_sentences`

`max_section_breaks`

`overlap_percent`

`offset`

`token_counter`

`base_metadata`