chunklet.document_chunker.processors.pdf_processor

Classes:

PDFProcessor –

PDF extraction and cleanup utility using pdfminer.six.

PDFProcessor

PDFProcessor(file_path: str)

Bases: BaseProcessor

PDF extraction and cleanup utility using pdfminer.six.

Provides methods to extract text and metadata from PDF files, while cleaning and normalizing the extracted text using regex patterns.

This processor extracts metadata from the PDF document's information dictionary, focusing on core metadata rather than all available fields.

For more details on PDF metadata extraction using pdfminer.six, refer to this relevant Stack Overflow discussion:

https://stackoverflow.com/questions/75591385/extract-metadata-info-from-online-pdf-using-pdfminer-in-python

Initialize the PDFProcessor.

Parameters:

file_path
(str) –

Path to the PDF file.

Methods:

extract_metadata –

Extracts metadata from the PDF document's information dictionary.
extract_text –

Yield cleaned text from each PDF page.

Source code in src/chunklet/document_chunker/processors/pdf_processor.py

def __init__(self, file_path: str):
    """Initialize the PDFProcessor.

    Args:
        file_path (str): Path to the PDF file.
    """
    try:
        from pdfminer.layout import LAParams
    except ImportError as e:
        raise ImportError(
            "The 'pdfminer.six' library is not installed. "
            "Please install it with 'pip install 'pdfminer.six>=20250324'' or install the document processing extras "
            "with 'pip install 'chunklet-py[document]''"
        ) from e
    self.file_path = file_path
    self.laparams = LAParams(
        line_margin=0.5,
    )

extract_metadata

extract_metadata() -> dict[str, Any]

Extracts metadata from the PDF document's information dictionary.

Includes source path, page count, and PDF info fields.

Returns:

dict[str, Any] –

dict[str, Any]: A dictionary containing metadata fields: - title - author - creator - producer - publisher - created - modified

Source code in src/chunklet/document_chunker/processors/pdf_processor.py

def extract_metadata(self) -> dict[str, Any]:
    """Extracts metadata from the PDF document's information dictionary.

    Includes source path, page count, and PDF info fields.

    Returns:
        dict[str, Any]: A dictionary containing metadata fields:
            - title
            - author
            - creator
            - producer
            - publisher
            - created
            - modified
    """
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument

    metadata = {"source": str(self.file_path), "page_count": 0}
    with open(self.file_path, "rb") as f:
        # Initialize parser on the file stream
        parser = PDFParser(f)

        # The PDFDocument constructor reads file structure and advances the pointer
        doc = PDFDocument(parser)

        # Count pages: Reset pointer to the start of the file stream to count pages correctly
        f.seek(0)

        metadata["page_count"] = ilen(PDFPage.get_pages(f))

        # Extract info fields from the document object
        if hasattr(doc, "info") and doc.info:
            for info in doc.info:
                for k, v in info.items():
                    k = self._safe_decode(k)
                    v = self._safe_decode(v)

                    # To keep metadata uniform with the other processorss
                    k = "created" if k == "CreationDate" else k
                    k = "modified" if k == "ModDate" else k

                    if k.lower() in self.METADATA_FIELDS:
                        metadata[k.lower()] = v
    return metadata

extract_text

extract_text() -> Generator[str, None, None]

Yield cleaned text from each PDF page.

Extracts text content page by page using pdfminer.high_level.extract_text for efficient processing. Each page is processed individually to avoid memory issues with large PDF files. The extracted text is cleaned using the _cleanup_text method to remove artifacts and normalize formatting.

Yields:

str ( str ) –

Cleaned text content from each PDF page.

Source code in src/chunklet/document_chunker/processors/pdf_processor.py

def extract_text(self) -> Generator[str, None, None]:
    """Yield cleaned text from each PDF page.

    Extracts text content page by page using pdfminer.high_level.extract_text
    for efficient processing. Each page is processed individually to avoid
    memory issues with large PDF files. The extracted text is cleaned using
    the _cleanup_text method to remove artifacts and normalize formatting.

    Yields:
        str: Cleaned text content from each PDF page.
    """
    from pdfminer.high_level import extract_text
    from pdfminer.pdfpage import PDFPage

    with open(self.file_path, "rb") as fp:
        page_count = ilen(PDFPage.get_pages(fp))

        for page_num in range(page_count):
            # Call extract_text on the file path, specifying the page number.
            # This is efficient as it avoids repeated file seeks/parsing
            # within the loop that was present in the old `extract_text_to_fp` approach.
            raw_text = extract_text(
                self.file_path,
                page_numbers=[page_num],
                laparams=self.laparams,
            )
            yield self._cleanup_text(raw_text)