Metadata in Chunklet-py
Chunklet-py automatically generates and manages metadata for each chunk, providing valuable context about its origin and characteristics. This guide explains how metadata is handled, its structure, and how you can leverage it effectively in your applications.
Each chunk generated by Chunklet-py is encapsulated in a Box object. This Box object contains a metadata attribute, which is a dictionary holding various pieces of information about the chunk. You can access this metadata using either dot notation (chunk.metadata) or dictionary-style access (chunk["metadata"]).
Common Metadata
Regardless of the chunking strategy used, every chunk generated by Chunklet-py includes the following common metadata fields:
chunk_num(int): A unique, sequential identifier for the chunk within its original source. This helps in ordering and referencing chunks.span(tuple[int, int]): Represents the start and end character indices of the chunk within the original text. This is particularly useful for pinpointing the exact location of a chunk in the source material.source(str): Indicates the origin of the chunk.- If the chunk was generated from a file and processed by DocumentChunker or CodeChunker, this will be the absolute path to that file.
- If the chunk was generated from a direct text input via the CLI, this will be
"stdin". - When PlainTextChunker is used programmatically with a string, it does not inherently assign a
sourceunless explicitly provided viabase_metadata. If not provided, thesourcemetadata field might be absent or default to an empty string/Nonedepending on thebase_metadatapassed. - For
CodeChunkerwhen a file is processed but the source cannot be determined (e.g., an internal error), it might be"N/A".
PlainTextChunker Metadata
When using the PlainTextChunker, chunks will primarily include the Common Metadata fields: chunk_num and span. The source field is only included if explicitly provided via the base_metadata parameter during the chunking call. PlainTextChunker does not add any additional metadata beyond these basic identifiers.
DocumentChunker Metadata
The DocumentChunker is designed to process various file types and, in addition to the Common Metadata fields, it extracts rich, file-specific metadata. This additional metadata provides deeper insights into the document's characteristics and origin.
section_count(int): For multi-section documents (e.g., multi-page PDFs, multi-chapter EPUBs), this indicates the total number of sections in the document.curr_section(int): For multi-section documents, this denotes the current section number of the chunk.
The exact fields extracted depend on the file type:
- PDF Files: If present in the document, metadata fields such as
title,author,creator,producer,publisher,created,modified, andpage_countwill be extracted. Whilepdfminer.sixsupports a wider range, these are the primary fields included. For more details, refer to the pdfminer.six documentation. - EPUB Files: Following the Dublin Core standard, if available in the document, fields like
title,creator,contributor,publisher,date, andrightswill be extracted. These are a selection of the supported metadata. For more details, refer to the ebooklib tutorial. - DOCX Files: Core properties such as
title,author,publisher,last_modified_by,created,modified,rights, andversionwill be extracted if they exist in the document. These are the key properties included from the broader set supported bypython-docx. For more details, refer to the python-docx documentation.
Important Note on Optional Metadata
These file-specific metadata fields are optional and may not be present in every document. For robust access, it is highly recommended to use the dictionary's get() method (e.g., chunk.metadata.get("author")) to avoid KeyErrors if a field is missing.
CodeChunker Metadata
The CodeChunker is specialized for processing code files and, in addition to the Common Metadata fields, it provides rich, code-specific metadata to help understand the structure and context of code chunks.
The additional metadata fields include:
tree(str): A string representation of the abstract syntax tree (AST) or a relevant portion of it for the code chunk. This provides structural context for the chunk.start_line(int): The starting line number of the code chunk in the original file.end_line(int): The ending line number of the code chunk in the original file.
This code-specific metadata is automatically included in the metadata dictionary of each Box object when you chunk code using CodeChunker.
CLI Metadata Output
When using the chunklet CLI, the inclusion and type of metadata in the output are dynamically determined by your input and the chunker flags you provide.
-
Controlling Metadata Output: The
--metadataflag (e.g.,chunklet chunk "..." --metadata) is your primary control for whether metadata is included in the final output.- If
--metadatais used, metadata will be displayed alongside your chunks (either printed to stdout or saved in.jsonfiles if a directory is specified for--destination). - If
--metadatais not used, only the chunk content will be output.
- If
-
Metadata Varies by Input and Chunker:
- Direct Text Input (e.g.,
chunklet chunk "Your text here..."): When you provide text directly, thePlainTextChunkeris used. The metadata will primarily consist of Common Metadata fields likechunk_numandspan. Thesourcewill be"stdin". - File Input with
--doc(e.g.,chunklet chunk --doc --source document.pdf): TheDocumentChunkeris employed. In addition to Common Metadata, you'll receive rich, file-specific metadata (e.g.,title,author,page_countfor PDFs) as detailed in the DocumentChunker Metadata section. - File Input with
--code(e.g.,chunklet chunk --code --source code.py): TheCodeChunkeris used. Metadata will include Common Metadata along with code-specific fields liketree,start_line,end_line, andsource_path, as described in the CodeChunker Metadata section.
- Direct Text Input (e.g.,
This dynamic approach ensures you get the most relevant contextual information for your chunks, tailored to how you're using Chunklet-py.