Skip to content

chunklet.document_chunker.processors.docx_processor

Classes:

  • DOCXProcessor

    Processor class for extracting text and metadata from DOCX files.

DOCXProcessor

DOCXProcessor(file_path: str)

Bases: BaseProcessor

Processor class for extracting text and metadata from DOCX files.

Text content is extracted, images are replaced with a placeholder, and the resulting text is formatted using Markdown conversion.

This class extracts metadata which typically uses a mix of Open Packaging Conventions (OPC) properties and elements that align with Dublin Core standards.

For more details on the DOCX core properties processed, refer to the python-docx documentation: https://python-docx.readthedocs.io/en/latest/dev/analysis/features/coreprops.html

Methods:

  • extract_metadata

    Extracts core properties (a mix of OPC and Dublin Core elements) from the DOCX file.

  • extract_text

    Extracts text content from DOCX file in Markdown format, yielding chunks for efficient processing.

Source code in src/chunklet/document_chunker/processors/base_processor.py
def __init__(self, file_path: str):
    """
    Initializes the processor with the path to the document.

    Args:
        file_path (str): Path to the document file.
    """
    self.file_path = file_path

extract_metadata

extract_metadata() -> dict[str, Any]

Extracts core properties (a mix of OPC and Dublin Core elements) from the DOCX file.

Returns:

  • dict[str, Any]

    dict[str, Any]: A dictionary containing metadata fields: - title - author - publisher - last_modified_by - created - modified - rights - version

Source code in src/chunklet/document_chunker/processors/docx_processor.py
def extract_metadata(self) -> dict[str, Any]:
    """Extracts core properties (a mix of OPC and Dublin Core elements) from the DOCX file.

    Returns:
        dict[str, Any]: A dictionary containing metadata fields:
            - title
            - author
            - publisher
            - last_modified_by
            - created
            - modified
            - rights
            - version
    """
    try:
        from docx import Document
    except ImportError as e:
        raise ImportError(
            "The 'python-docx' library is not installed. "
            "Please install it with 'pip install 'python-docx>=1.2.0'' or install the document processing extras "
            "with 'pip install 'chunklet-py[document]''"
        ) from e

    doc = Document(self.file_path)
    props = doc.core_properties
    metadata = {"source": str(self.file_path)}
    for field in self.METADATA_FIELDS:
        value = getattr(props, field, "")
        if value:
            metadata[field] = str(value)
    return metadata

extract_text

extract_text() -> Generator[str, None, None]

Extracts text content from DOCX file in Markdown format, yielding chunks for efficient processing.

Images are replaced with a placeholder "[Image - num]". Text is yielded in chunks of approximately 4000 characters each to simulate pages and enhance parallel execution.

Yields:

  • str ( str ) –

    A chunk of text, approximately 4000 characters each.

Source code in src/chunklet/document_chunker/processors/docx_processor.py
def extract_text(self) -> Generator[str, None, None]:
    """Extracts text content from DOCX file in Markdown format, yielding chunks for efficient processing.

    Images are replaced with a placeholder "[Image - num]".
    Text is yielded in chunks of approximately 4000 characters each to simulate pages and enhance parallel execution.

    Yields:
        str: A chunk of text, approximately 4000 characters each.
    """
    try:  # Lazy import
        import mammoth
    except ImportError as e:
        raise ImportError(
            "The 'mammoth' library is not installed. "
            "Please install it with 'pip install 'mammoth>=1.9.0'' or install the document processing extras "
            "with 'pip install 'chunklet-py[document]''"
        ) from e

    count = 0

    def placeholder_images(image):
        """Replace all images with a placeholder text."""
        nonlocal count
        count += 1
        return [mammoth.html.text(f"[Image - {count}]")]

    with open(self.file_path, "rb") as docx_file:
        # Convert DOCX to HTML first
        result = mammoth.convert_to_html(
            docx_file, convert_image=placeholder_images
        )
        markdown_content = html_to_md(raw_text=result.value)

    # Split into paragraphs and accumulate by character count (~4000 chars per chunk)
    curr_chunk = []
    curr_size = 0
    max_size = 4000

    for paragraph in markdown_content.split("\n\n"):
        para_len = len(paragraph)

        # If adding this paragraph would exceed the limit, yield current chunk
        if curr_size + para_len > max_size and curr_chunk:
            yield "\n\n".join(curr_chunk)
            curr_chunk = []
            curr_size = 0

        curr_chunk.append(paragraph)
        curr_size += para_len

    # Yield any remaining content
    if curr_chunk:
        yield "\n\n".join(curr_chunk)