Skip to content

chunklet.document_chunker.processors.pdf_processor

Classes:

  • PDFProcessor

    PDF extraction and cleanup utility using pdfminer.six.

PDFProcessor

PDFProcessor(file_path: str)

Bases: BaseProcessor

PDF extraction and cleanup utility using pdfminer.six.

Provides methods to extract text and metadata from PDF files, while cleaning and normalizing the extracted text using regex patterns.

This processor extracts metadata from the PDF document's information dictionary, focusing on core metadata rather than all available fields.

For more details on PDF metadata extraction using pdfminer.six, refer to this relevant Stack Overflow discussion:

https://stackoverflow.com/questions/75591385/extract-metadata-info-from-online-pdf-using-pdfminer-in-python

Initialize the PDFProcessor.

Parameters:

  • file_path

    (str) –

    Path to the PDF file.

Methods:

  • extract_metadata

    Extracts metadata from the PDF document's information dictionary.

  • extract_text

    Yield cleaned text from each PDF page.

Source code in src/chunklet/document_chunker/processors/pdf_processor.py
def __init__(self, file_path: str):
    """Initialize the PDFProcessor.

    Args:
        file_path (str): Path to the PDF file.
    """
    try:
        from pdfminer.layout import LAParams
    except ImportError as e:
        raise ImportError(
            "The 'pdfminer.six' library is not installed. "
            "Please install it with 'pip install 'pdfminer.six>=20250324'' or install the document processing extras "
            "with 'pip install 'chunklet-py[document]''"
        ) from e
    self.file_path = file_path
    self.laparams = LAParams(
        line_margin=0.5,
    )

extract_metadata

extract_metadata() -> dict[str, Any]

Extracts metadata from the PDF document's information dictionary.

Includes source path, page count, and PDF info fields.

Returns:

  • dict[str, Any]

    dict[str, Any]: A dictionary containing metadata fields: - title - author - creator - producer - publisher - created - modified

Source code in src/chunklet/document_chunker/processors/pdf_processor.py
def extract_metadata(self) -> dict[str, Any]:
    """Extracts metadata from the PDF document's information dictionary.

    Includes source path, page count, and PDF info fields.

    Returns:
        dict[str, Any]: A dictionary containing metadata fields:
            - title
            - author
            - creator
            - producer
            - publisher
            - created
            - modified
    """
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument

    metadata = {"source": str(self.file_path), "page_count": 0}
    with open(self.file_path, "rb") as f:
        # Initialize parser on the file stream
        parser = PDFParser(f)

        # The PDFDocument constructor reads file structure and advances the pointer
        doc = PDFDocument(parser)

        # Count pages: Reset pointer to the start of the file stream to count pages correctly
        f.seek(0)

        metadata["page_count"] = ilen(PDFPage.get_pages(f))

        # Extract info fields from the document object
        if hasattr(doc, "info") and doc.info:
            for info in doc.info:
                for k, v in info.items():

                    # To keep metadata uniform
                    k = "created" if k == "CreationDate" else k
                    k = "modified" if k == "ModDate" else k

                    if k.lower() in self.METADATA_FIELDS:
                        if isinstance(k, bytes):
                            k = k.decode("utf-8", "ignore")
                        if isinstance(v, bytes):
                            v = v.decode("utf-8", "ignore")
                        metadata[k.lower()] = v
    return metadata

extract_text

extract_text() -> Generator[str, None, None]

Yield cleaned text from each PDF page.

Uses pdfminer.high_level.extract_text for efficient page-by-page extraction.

Yields:

  • str ( str ) –

    Markdown-formatted text of each page.

Source code in src/chunklet/document_chunker/processors/pdf_processor.py
def extract_text(self) -> Generator[str, None, None]:
    """Yield cleaned text from each PDF page.

    Uses pdfminer.high_level.extract_text for efficient page-by-page extraction.

    Yields:
        str: Markdown-formatted text of each page.
    """
    from pdfminer.high_level import extract_text
    from pdfminer.pdfpage import PDFPage

    with open(self.file_path, "rb") as fp:
        page_count = ilen(PDFPage.get_pages(fp))

        for page_num in range(page_count):
            # Call extract_text on the file path, specifying the page number.
            # This is efficient as it avoids repeated file seeks/parsing
            # within the loop that was present in the old `extract_text_to_fp` approach.
            raw_text = extract_text(
                self.file_path,
                page_numbers=[page_num],
                laparams=self.laparams,
            )
            yield self._cleanup_text(raw_text)