Skip to content

chunklet.document_chunker.processors.docx_processor

Classes:

  • DocxProcessor

    Processor class for extracting text and metadata from DOCX files.

DocxProcessor

DocxProcessor(file_path: str)

Bases: BaseProcessor

Processor class for extracting text and metadata from DOCX files.

Text content is extracted, images are replaced with a placeholder, and the resulting text is formatted using Markdown conversion.

This class extracts metadata which typically uses a mix of Open Packaging Conventions (OPC) properties and elements that align with Dublin Core standards.

For more details on the DOCX core properties processed, refer to the python-docx documentation: https://python-docx.readthedocs.io/en/latest/dev/analysis/features/coreprops.html

Methods:

  • extract_metadata

    Extracts core properties (a mix of OPC and Dublin Core elements) from the DOCX file.

  • extract_text

    Extracts the text content from the DOCX file in Markdown format.

Source code in src/chunklet/document_chunker/processors/base_processor.py
def __init__(self, file_path: str):
    """
    Initializes the processor with the path to the document.

    Args:
        file_path (str): Path to the document file.
    """
    self.file_path = file_path

extract_metadata

extract_metadata() -> dict[str, Any]

Extracts core properties (a mix of OPC and Dublin Core elements) from the DOCX file.

Returns:

  • dict[str, Any]

    dict[str, Any]: A dictionary containing metadata fields: - title - author - publisher - last_modified_by - created - modified - rights - version

Source code in src/chunklet/document_chunker/processors/docx_processor.py
def extract_metadata(self) -> dict[str, Any]:
    """Extracts core properties (a mix of OPC and Dublin Core elements) from the DOCX file.

    Returns:
        dict[str, Any]: A dictionary containing metadata fields:
            - title
            - author
            - publisher
            - last_modified_by
            - created
            - modified
            - rights
            - version
    """
    try:
        from docx import Document
    except ImportError as e:
        raise ImportError(
            "The 'python-docx' library is not installed. "
            "Please install it with 'pip install 'python-docx>=1.2.0'' or install the document processing extras "
            "with 'pip install 'chunklet-py[document]''"
        ) from e

    doc = Document(self.file_path)
    props = doc.core_properties
    metadata = {"source": str(self.file_path)}
    for field in self.METADATA_FIELDS:
        value = getattr(props, field, "")
        if value:
            metadata[field] = str(value)
    return metadata

extract_text

extract_text() -> Generator[str, None, None]

Extracts the text content from the DOCX file in Markdown format.

Images are replaced with a placeholder "[Image - num]". Text is yielded in blocks of approximately 10 paragraphs each.

Yields:

  • str ( str ) –

    A block of Markdown text, approximately 10 paragraphs each.

Source code in src/chunklet/document_chunker/processors/docx_processor.py
def extract_text(self) -> Generator[str, None, None]:
    """Extracts the text content from the DOCX file in Markdown format.

    Images are replaced with a placeholder "[Image - num]".
    Text is yielded in blocks of approximately 10 paragraphs each.

    Yields:
        str: A block of Markdown text, approximately 10 paragraphs each.
    """
    try:  # Lazy import
        import mammoth
    except ImportError as e:
        raise ImportError(
            "The 'mammoth' library is not installed. "
            "Please install it with 'pip install 'mammoth>=1.9.0'' or install the document processing extras "
            "with 'pip install 'chunklet-py[document]''"
        ) from e

    count = 0

    def placeholder_images(image):
        """Replace all images with a placeholder text."""
        nonlocal count
        count += 1
        return [mammoth.html.text(f"[Image - {count}]")]

    with open(self.file_path, "rb") as docx_file:
        # Convert DOCX to HTML first
        result = mammoth.convert_to_html(
            docx_file, convert_image=placeholder_images
        )
        html_content = result.value

    # Now we can convert it to markdown
    markdown_content = html_to_md(raw_text=html_content)

    # Chunk its paragraphs into groups of 8 for faster processing.
    paragraphs = markdown_content.split("\n\n")
    for paragraph_chunk in chunked(paragraphs, 8):
        yield "\n\n".join(paragraph_chunk)