Metadata in Chunklet-py

Chunklet-py automatically generates and manages metadata for each chunk, providing valuable context about its origin and characteristics. This guide explains how metadata is handled, its structure, and how you can leverage it effectively in your applications.

Each chunk generated by Chunklet-py is encapsulated in a Box object. This Box object contains a metadata attribute, which is a dictionary holding various pieces of information about the chunk. You can access this metadata using either dot notation (chunk.metadata) or dictionary-style access (chunk["metadata"]).

Common Metadata

Regardless of the chunking strategy used, every chunk generated by Chunklet-py includes the following common metadata fields:

chunk_num (int): A unique, sequential identifier for the chunk within its original source. This helps in ordering and referencing chunks.
span (tuple[int, int]): Represents the start and end character indices of the chunk within the original text. This is particularly useful for pinpointing the exact location of a chunk in the source material.
source (str): Indicates the origin of the chunk.
- If the chunk was generated from a file and processed by DocumentChunker or CodeChunker, this will be the absolute path to that file.
- If the chunk was generated from a direct text input via the CLI, this will be "stdin".
- When PlainTextChunker is used programmatically with a string, it does not inherently assign a source unless explicitly provided via base_metadata. If not provided, the source metadata field might be absent or default to an empty string/None depending on the base_metadata passed.
- For CodeChunker when a file is processed but the source cannot be determined (e.g., an internal error), it might be "N/A".

PlainTextChunker Metadata

When using the PlainTextChunker, chunks will primarily include the Common Metadata fields: chunk_num and span. The source field is only included if explicitly provided via the base_metadata parameter during the chunking call. PlainTextChunker does not add any additional metadata beyond these basic identifiers.

DocumentChunker Metadata

The DocumentChunker is designed to process various file types and, in addition to the Common Metadata fields, it extracts rich, file-specific metadata. This additional metadata provides deeper insights into the document's characteristics and origin.

section_count (int): For multi-section documents (e.g., multi-page PDFs, multi-chapter EPUBs), this indicates the total number of sections in the document.
curr_section (int): For multi-section documents, this denotes the current section number of the chunk.

The exact fields extracted depend on the file type:

PDF Files: If present in the document, metadata fields such as title, author, creator, producer, publisher, created, modified, and page_count will be extracted. While pdfminer.six supports a wider range, these are the primary fields included. For more details, refer to the pdfminer.six documentation.
EPUB Files: Following the Dublin Core standard, if available in the document, fields like title, creator, contributor, publisher, date, and rights will be extracted. These are a selection of the supported metadata. For more details, refer to the ebooklib tutorial.
DOCX Files: Core properties such as title, author, publisher, last_modified_by, created, modified, rights, and version will be extracted if they exist in the document. These are the key properties included from the broader set supported by python-docx. For more details, refer to the python-docx documentation.

Important Note on Optional Metadata

These file-specific metadata fields are optional and may not be present in every document. For robust access, it is highly recommended to use the dictionary's get() method (e.g., chunk.metadata.get("author")) to avoid KeyErrors if a field is missing.

CodeChunker Metadata

The CodeChunker is specialized for processing code files and, in addition to the Common Metadata fields, it provides rich, code-specific metadata to help understand the structure and context of code chunks.

The additional metadata fields include:

tree (str): A string representation of the abstract syntax tree (AST) or a relevant portion of it for the code chunk. This provides structural context for the chunk.
start_line (int): The starting line number of the code chunk in the original file.
end_line (int): The ending line number of the code chunk in the original file.

This code-specific metadata is automatically included in the metadata dictionary of each Box object when you chunk code using CodeChunker.

CLI Metadata Output

When using the chunklet CLI, the inclusion and type of metadata in the output are dynamically determined by your input and the chunker flags you provide.

Controlling Metadata Output: The --metadata flag (e.g., chunklet chunk "..." --metadata) is your primary control for whether metadata is included in the final output.
- If --metadata is used, metadata will be displayed alongside your chunks (either printed to stdout or saved in .json files if a directory is specified for --destination).
- If --metadata is not used, only the chunk content will be output.
Metadata Varies by Input and Chunker:
- Direct Text Input (e.g., chunklet chunk "Your text here..."): When you provide text directly, the PlainTextChunker is used. The metadata will primarily consist of Common Metadata fields like chunk_num and span. The source will be "stdin".
- File Input with --doc (e.g., chunklet chunk --doc --source document.pdf): The DocumentChunker is employed. In addition to Common Metadata, you'll receive rich, file-specific metadata (e.g., title, author, page_count for PDFs) as detailed in the DocumentChunker Metadata section.
- File Input with --code (e.g., chunklet chunk --code --source code.py): The CodeChunker is used. Metadata will include Common Metadata along with code-specific fields like tree, start_line, end_line, and source_path, as described in the CodeChunker Metadata section.

This dynamic approach ensures you get the most relevant contextual information for your chunks, tailored to how you're using Chunklet-py.