chunklet.document_chunker.processors.epub_processor
Classes:
-
EpubProcessor–Processor class for extracting text and metadata from EPUB files.
EpubProcessor
Bases: BaseProcessor
Processor class for extracting text and metadata from EPUB files.
Text content is extracted by concatenating the text from all HTML content documents within the EPUB container.
This processor focuses on extracting core metadata following the Dublin Core Metadata Initiative (DCMI) standard, which is the common practice in EPUB files. Not all available metadata fields are extracted.
For more details on EPUB metadata and the Dublin Core standard, refer to the
ebooklib tutorial:
https://docs.sourcefabric.org/projects/ebooklib/en/latest/tutorial.html
Initializes the EpubProcessor with a path to the EPUB file and reads the EPUB book into memory.
Parameters:
-
(file_pathstr) –Path to the EPUB file.
Methods:
-
extract_metadata–Extracts Dublin Core metadata from the EPUB file.
-
extract_text–Yields Markdown-converted text from all document items in the EPUB file.
Source code in src/chunklet/document_chunker/processors/epub_processor.py
extract_metadata
Extracts Dublin Core metadata from the EPUB file.
Returns:
-
dict[str, Any]–dict[str, Any]: A dictionary containing metadata fields. - title - creator - contributor - publisher - date - rights
Source code in src/chunklet/document_chunker/processors/epub_processor.py
extract_text
Yields Markdown-converted text from all document items in the EPUB file.
Yields:
-
str(str) –Markdown-formatted text of each document item.