chunklet.document_chunker.processors.docx_processor
Classes:
-
DocxProcessor–Processor class for extracting text and metadata from DOCX files.
DocxProcessor
Bases: BaseProcessor
Processor class for extracting text and metadata from DOCX files.
Text content is extracted, images are replaced with a placeholder, and the resulting text is formatted using Markdown conversion.
This class extracts metadata which typically uses a mix of Open Packaging Conventions (OPC) properties and elements that align with Dublin Core standards.
For more details on the DOCX core properties processed, refer to the
python-docx documentation:
https://python-docx.readthedocs.io/en/latest/dev/analysis/features/coreprops.html
Methods:
-
extract_metadata–Extracts core properties (a mix of OPC and Dublin Core elements) from the DOCX file.
-
extract_text–Extracts the text content from the DOCX file in Markdown format.
Source code in src/chunklet/document_chunker/processors/base_processor.py
extract_metadata
Extracts core properties (a mix of OPC and Dublin Core elements) from the DOCX file.
Returns:
-
dict[str, Any]–dict[str, Any]: A dictionary containing metadata fields: - title - author - publisher - last_modified_by - created - modified - rights - version
Source code in src/chunklet/document_chunker/processors/docx_processor.py
extract_text
Extracts the text content from the DOCX file in Markdown format.
Images are replaced with a placeholder "[Image - num]". Text is yielded in blocks of approximately 10 paragraphs each.
Yields:
-
str(str) –A block of Markdown text, approximately 10 paragraphs each.