Skip to content

chunklet.document_chunker.document_chunker

Classes:

  • DocumentChunker

    A comprehensive document chunker that handles various file formats.

DocumentChunker

DocumentChunker(
    sentence_splitter: Any | None = None,
    verbose: bool = False,
    continuation_marker: str = "...",
    token_counter: Callable[[str], int] | None = None,
)

A comprehensive document chunker that handles various file formats.

This class provides a high-level interface to chunk text from different document types. It automatically detects the file format and uses the appropriate method to extract content before passing it to an underlying PlainTextChunker instance.

Key Features: - Multi-Format Support: Chunks text from PDF, TXT, MD, and RST files. - Metadata Enrichment: Automatically adds source file path and other document-level metadata (e.g., PDF page numbers) to each chunk. - Bulk Processing: Efficiently chunks multiple documents in a single call. - Pluggable Document processors: Integrate custom processors allowing definition of specific logic for extracting text from various file types.

Initializes the DocumentChunker.

Parameters:

  • sentence_splitter

    (BaseSplitter | None, default: None ) –

    An optional BaseSplitter instance. If None, a default SentenceSplitter will be initialized.

  • verbose

    (bool, default: False ) –

    Enable verbose logging.

  • continuation_marker

    (str, default: '...' ) –

    The marker to prepend to unfitted clauses. Defaults to '...'.

  • token_counter

    (Callable[[str], int] | None, default: None ) –

    Function that counts tokens in text. If None, must be provided when calling chunk() methods.

Raises:

  • InvalidInputError

    If any of the input arguments are invalid or if the provided sentence_splitter is not an instance of BaseSplitter.

Methods:

  • batch_chunk

    Chunks multiple documents from a list of file paths.

  • chunk

    Chunks a single document from a given path.

Attributes:

Source code in src/chunklet/document_chunker/document_chunker.py
def __init__(
    self,
    sentence_splitter: Any | None = None,
    verbose: bool = False,
    continuation_marker: str = "...",
    token_counter: Callable[[str], int] | None = None,
):
    """
    Initializes the DocumentChunker.

    Args:
        sentence_splitter (BaseSplitter | None): An optional BaseSplitter instance.
            If None, a default SentenceSplitter will be initialized.
        verbose (bool): Enable verbose logging.
        continuation_marker (str): The marker to prepend to unfitted clauses. Defaults to '...'.
        token_counter (Callable[[str], int] | None): Function that counts tokens in text.
            If None, must be provided when calling chunk() methods.

    Raises:
        InvalidInputError: If any of the input arguments are invalid or if the provided `sentence_splitter` is not an instance of `BaseSplitter`.
    """
    self._verbose = verbose
    self.token_counter = token_counter
    self.continuation_marker = continuation_marker

    # Explicit type validation for sentence_splitter
    if sentence_splitter is not None and not isinstance(
        sentence_splitter, BaseSplitter
    ):
        raise InvalidInputError(
            f"The provided sentence_splitter must be an instance of BaseSplitter, "
            f"but got {type(sentence_splitter).__name__}."
        )

    self.plain_text_chunker = PlainTextChunker(
        sentence_splitter=sentence_splitter,
        verbose=self._verbose,
        continuation_marker=self.continuation_marker,
        token_counter=self.token_counter,
    )

    self.processors = {
        ".pdf": pdf_processor.PDFProcessor,
        ".epub": epub_processor.EpubProcessor,
        ".docx": docx_processor.DocxProcessor,
    }
    self.converters = {
        ".html": html_2_md.html_to_md,
        ".hml": html_2_md.html_to_md,
        ".rst": rst_2_md.rst_to_md,
        ".tex": latex_2_md.latex_to_md,
    }
    self.processor_registry = CustomProcessorRegistry()

supported_extensions property

supported_extensions

Get the supported extensions, including the custom ones.

verbose property writable

verbose: bool

Get the verbosity status.

batch_chunk

batch_chunk(
    paths: restricted_iterable(str | Path),
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[
        int | None, Field(ge=1)
    ] = None,
    max_section_breaks: Annotated[
        int | None, Field(ge=1)
    ] = None,
    overlap_percent: Annotated[
        int, Field(ge=0, le=75)
    ] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise"
) -> Generator[Box, None, None]

Chunks multiple documents from a list of file paths.

This method is a memory-efficient generator that yields chunks as they are processed, without loading all documents into memory at once. It handles various file types.

Parameters:

  • paths

    (restricted_iterable[str | Path]) –

    A restricted iterable of paths to the document files.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".

  • max_tokens

    (int, default: None ) –

    Maximum number of tokens per chunk. Must be >= 12.

  • max_sentences

    (int, default: None ) –

    Maximum number of sentences per chunk. Must be >= 1.

  • max_section_breaks

    (int, default: None ) –

    Maximum number of section breaks per chunk. Must be >= 1.

  • overlap_percent

    (int | float, default: 20 ) –

    Percentage of overlap between chunks (0-85).

  • offset

    (int, default: 0 ) –

    Starting sentence offset for chunking. Defaults to 0.

  • token_counter

    (callable | None, default: None ) –

    Optional token counting function. Required if max_tokens is provided.

  • separator

    (Any, default: None ) –

    A value to be yielded after the chunks of each text are processed. Note: None cannot be used as a separator.

  • n_jobs

    (int | None, default: None ) –

    Number of parallel workers to use. If None, uses all available CPUs. Must be >= 1 if specified.

  • show_progress

    (bool, default: True ) –

    Flag to show or disable the loading bar.

  • on_errors

    (Literal['raise', 'skip', 'break'], default: 'raise' ) –

    How to handle errors during processing. Can be 'raise', 'ignore', or 'break'.

Yields:

  • Box ( Box ) –

    Box object, representing a chunk with its content and metadata.

Raises:

Source code in src/chunklet/document_chunker/document_chunker.py
@validate_input
def batch_chunk(
    self,
    paths: restricted_iterable(str | Path),
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[int | None, Field(ge=1)] = None,
    max_section_breaks: Annotated[int | None, Field(ge=1)] = None,
    overlap_percent: Annotated[int, Field(ge=0, le=75)] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise",
) -> Generator[Box, None, None]:
    """
    Chunks multiple documents from a list of file paths.

    This method is a memory-efficient generator that yields chunks as they
    are processed, without loading all documents into memory at once. It
    handles various file types.

    Args:
        paths (restricted_iterable[str | Path]): A restricted iterable of paths to the document files.
        lang (str): The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
        max_tokens (int, optional): Maximum number of tokens per chunk. Must be >= 12.
        max_sentences (int, optional): Maximum number of sentences per chunk. Must be >= 1.
        max_section_breaks (int, optional): Maximum number of section breaks per chunk. Must be >= 1.
        overlap_percent (int | float): Percentage of overlap between chunks (0-85).
        offset (int): Starting sentence offset for chunking. Defaults to 0.
        token_counter (callable | None): Optional token counting function.
            Required if `max_tokens` is provided.
        separator (Any): A value to be yielded after the chunks of each text are processed.
            Note: None cannot be used as a separator.

        n_jobs (int | None): Number of parallel workers to use. If None, uses all available CPUs.
               Must be >= 1 if specified.
        show_progress (bool): Flag to show or disable the loading bar.
        on_errors: How to handle errors during processing. Can be 'raise', 'ignore', or 'break'.

    yields:
        Box: `Box` object, representing a chunk with its content and metadata.

    Raises:
        InvalidInputError: If the input arguments aren't valid.
        FileNotFoundError: If provided file path not found.
        UnsupportedFileTypeError: If the file extension is not supported or is missing.
        MissingTokenCounterError: If `max_tokens` is provided but no `token_counter` is provided.
        CallbackError: If a callback function (e.g., custom processors callbacks) fails during execution.
    """
    sentinel = object()

    # Validate all paths upfront
    sucess_count = 0
    validated_paths = []
    for i, path in enumerate(paths):
        path = Path(path)
        try:
            ext = self._validate_and_get_extension(path)
            validated_paths.append((path, ext, None))
            sucess_count += 1
        except Exception as e:
            validated_paths.append((path, None, e))

    gathered_data = self._gather_all_data(validated_paths, on_errors)

    all_chunks_gen = self.plain_text_chunker.batch_chunk(
        texts=gathered_data["all_texts_gen"],
        lang=lang,
        max_tokens=max_tokens,
        max_sentences=max_sentences,
        max_section_breaks=max_section_breaks,
        overlap_percent=overlap_percent,
        offset=offset,
        token_counter=token_counter or self.token_counter,
        separator=sentinel,
        n_jobs=n_jobs,
        show_progress=show_progress,
        on_errors=on_errors,
    )

    all_chunk_groups = split_at(all_chunks_gen, lambda x: x is sentinel)
    path_section_counts = gathered_data["path_section_counts"]
    all_metadata = gathered_data["all_metadata"]

    # HACK: Since a sentinel is always at the end of the gen,
    # the last list of the groups will be an empty one.
    # The only work-around to add a sentinel at paths
    paths = list(path_section_counts.keys()) + [None]

    doc_count = 0
    curr_path = paths[0]
    for chunks in all_chunk_groups:
        if path_section_counts[curr_path] == 0:
            if separator is not None:
                yield separator

            doc_count += 1
            curr_path = paths[doc_count]
            if curr_path is None:
                return

        for i, ch in enumerate(chunks, start=1):
            doc_metadata = all_metadata[doc_count]
            doc_metadata["section_count"] = path_section_counts[curr_path]
            doc_metadata["curr_section"] = i

            ch["metadata"].update(doc_metadata)
            yield ch

        path_section_counts[curr_path] -= 1

chunk

chunk(
    path: str | Path,
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[
        int | None, Field(ge=1)
    ] = None,
    max_section_breaks: Annotated[
        int | None, Field(ge=1)
    ] = None,
    overlap_percent: Annotated[
        int, Field(ge=0, le=75)
    ] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None
) -> list[Box]

Chunks a single document from a given path.

This method automatically detects the file type and uses the appropriate processor to extract text before chunking. It then adds document-level metadata to each resulting chunk.

Parameters:

  • path

    (str | Path) –

    The path to the document file.

  • lang

    (str, default: 'auto' ) –

    The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".

  • max_tokens

    (int, default: None ) –

    Maximum number of tokens per chunk. Must be >= 12.

  • max_sentences

    (int, default: None ) –

    Maximum number of sentences per chunk. Must be >= 1.

  • max_section_breaks

    (int, default: None ) –

    Maximum number of section breaks per chunk. Must be >= 1.

  • overlap_percent

    (int | float, default: 20 ) –

    Percentage of overlap between chunks (0-85).

  • offset

    (int, default: 0 ) –

    Starting sentence offset for chunking. Defaults to 0.

  • token_counter

    (callable | None, default: None ) –

    Optional token counting function. Required if max_tokens is provided.

Returns:

  • list[Box]

    list[Box]: A list of Box objects, each representing

  • list[Box]

    a chunk with its content and metadata.

Raises:

Source code in src/chunklet/document_chunker/document_chunker.py
@validate_input
def chunk(
    self,
    path: str | Path,
    *,
    lang: str = "auto",
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_sentences: Annotated[int | None, Field(ge=1)] = None,
    max_section_breaks: Annotated[int | None, Field(ge=1)] = None,
    overlap_percent: Annotated[int, Field(ge=0, le=75)] = 20,
    offset: Annotated[int, Field(ge=0)] = 0,
    token_counter: Callable[[str], int] | None = None,
) -> list[Box]:
    """
    Chunks a single document from a given path.

    This method automatically detects the file type and uses the appropriate
    processor to extract text before chunking. It then adds document-level
    metadata to each resulting chunk.

    Args:
        path (str | Path): The path to the document file.
        lang (str): The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
        max_tokens (int, optional): Maximum number of tokens per chunk. Must be >= 12.
        max_sentences (int, optional): Maximum number of sentences per chunk. Must be >= 1.
        max_section_breaks (int, optional): Maximum number of section breaks per chunk. Must be >= 1.
        overlap_percent (int | float): Percentage of overlap between chunks (0-85).
        offset (int): Starting sentence offset for chunking. Defaults to 0.
        token_counter (callable | None): Optional token counting function.
            Required if `max_tokens` is provided.

    Returns:
        list[Box]: A list of `Box` objects, each representing
        a chunk with its content and metadata.

    Raises:
        InvalidInputError: If the input arguments aren't valid.
        FileNotFoundError: If provided file path not found.
        UnsupportedFileTypeError: If the file extension is not supported or is missing.
        MissingTokenCounterError: If `max_tokens` is provided but no `token_counter` is provided.
        CallbackError: If a callback function (e.g., custom processors callbacks) fails during execution.
    """
    path = Path(path)
    ext = self._validate_and_get_extension(path)

    text_content_or_generator, document_metadata = self._extract_data(path, ext)

    if not isinstance(text_content_or_generator, str):
        raise UnsupportedFileTypeError(
            f"File type '{ext}' is not supported by the general chunk method.\n"
            "Reason: The processor for this file returns iterable, "
            "so it must be processed in parallel for efficiency.\n"
            "💡 Hint: use `chunker.batch_chunk()` for this file type."
        )

    if self.verbose:
        logger.info("Starting chunk processing for path: {}.", path)

    text_content = text_content_or_generator

    # Process as a single block of text
    chunk_boxes = self.plain_text_chunker.chunk(
        text=text_content,
        lang=lang,
        max_tokens=max_tokens,
        max_sentences=max_sentences,
        max_section_breaks=max_section_breaks,
        overlap_percent=overlap_percent,
        offset=offset,
        token_counter=token_counter or self.token_counter,
        base_metadata=document_metadata,
    )

    if self.verbose:
        logger.info("Generated {} chunks for {}.", len(chunk_boxes), path)

    return chunk_boxes