chunklet.document_chunker.processors.odt_processor
Classes:
-
ODTProcessor–ODT extraction and processing utility using
odfpy.
ODTProcessor
Bases: BaseProcessor
ODT extraction and processing utility using odfpy.
Provides methods to extract text and metadata from ODT (OpenDocument Text) files, while processing the extracted text into manageable chunks.
This processor extracts metadata from the ODT document's Dublin Core and OpenDocument standard properties.
For more details on ODF metadata fields and odfpy usage, refer to:
https://odfpy.readthedocs.io/en/latest/
Initialize the ODTProcessor.
Parameters:
-
(file_pathstr) –Path to the ODT file.
Methods:
-
extract_metadata–Extracts metadata from the ODT file, focusing on Dublin Core and OpenDocument fields.
-
extract_text–Extracts text content from ODT paragraphs, yielding chunks for efficient processing.
Source code in src/chunklet/document_chunker/processors/odt_processor.py
extract_metadata
Extracts metadata from the ODT file, focusing on Dublin Core and OpenDocument fields.
Parses the document's metadata elements, extracting fields such as:
Only present fields are included in the returned dictionary.
Returns:
-
dict[str, Any]–dict[str, Any]: A dictionary containing metadata fields: - title - creator - initial_creator - created - chapter - author_name
Source code in src/chunklet/document_chunker/processors/odt_processor.py
extract_text
Extracts text content from ODT paragraphs, yielding chunks for efficient processing.
Iterates through paragraph elements in the document, extracting text content and buffering it into chunks of approximately 4000 characters. This allows for memory-efficient processing of large documents by yielding text blocks that simulate pages and enhance parallel execution.
Yields:
-
str(str) –A chunk of text, approximately 4000 characters each.