chunklet.document_chunker.processors.pdf_processor
Classes:
-
PDFProcessor–PDF extraction and cleanup utility using
pdfminer.six.
PDFProcessor
Bases: BaseProcessor
PDF extraction and cleanup utility using pdfminer.six.
Provides methods to extract text and metadata from PDF files, while cleaning and normalizing the extracted text using regex patterns.
This processor extracts metadata from the PDF document's information dictionary, focusing on core metadata rather than all available fields.
For more details on PDF metadata extraction using pdfminer.six, refer to
this relevant Stack Overflow discussion:
https://stackoverflow.com/questions/75591385/extract-metadata-info-from-online-pdf-using-pdfminer-in-python
Initialize the PDFProcessor.
Parameters:
-
(file_pathstr) –Path to the PDF file.
Methods:
-
extract_metadata–Extracts metadata from the PDF document's information dictionary.
-
extract_text–Yield cleaned text from each PDF page.
Source code in src/chunklet/document_chunker/processors/pdf_processor.py
extract_metadata
Extracts metadata from the PDF document's information dictionary.
Includes source path, page count, and PDF info fields.
Returns:
-
dict[str, Any]–dict[str, Any]: A dictionary containing metadata fields: - title - author - creator - producer - publisher - created - modified
Source code in src/chunklet/document_chunker/processors/pdf_processor.py
extract_text
Yield cleaned text from each PDF page.
Uses pdfminer.high_level.extract_text for efficient page-by-page extraction.
Yields:
-
str(str) –Markdown-formatted text of each page.