Metadata in Chunklet-py: Your Chunk's Story 📖
Ever wondered where your chunks come from and what makes them tick? 🤔 Chunklet-py's metadata system tells the whole story! Each chunk comes with rich contextual information about its origin, location, and characteristics. Think of metadata as your chunk's detailed biography - the who, what, when, and where of your text.
Every chunk is wrapped in a handy Box object with a metadata attribute. This metadata dictionary is your treasure trove of chunk insights. Access it easily with dot notation (chunk.metadata) or dictionary-style (chunk["metadata"]) - your choice!
Common Metadata: The Essentials 📋
No matter which chunker you use, every chunk includes these fundamental metadata fields. Think of them as your chunk's basic information - the essentials you need to know.
chunk_num(int): Your chunk's sequential ID number within each source - perfect for keeping things organizedspan(tuple[int, int]): Character position coordinates showing exactly where this chunk sits in the original textsource(str): Where did this chunk come from? (The origin story!)- File processing: Absolute path to the file (for DocumentChunker or CodeChunker)
- CLI text input:
"stdin"(because it came from standard input) - Document chunker Text input: Only included if you provide it via
base_metadataparameter - CodeChunker edge cases: Might be
"N/A"if the source can't be determined
DocumentChunker Metadata: Rich & Detailed 📚
The DocumentChunker provides comprehensive metadata for both text and file inputs. The metadata varies based on your input type - think of it as your chunk's detailed biography! 👇
Text Input
Keeps things straightforward and clean. Your chunks include the essential Common Metadata fields (chunk_num and span). No frills, just the basics - perfect when you want clean, minimal metadata without any extra baggage. Additional metadata can be provided via the base_metadata parameter.
File Input
Need more context? File input's got you covered! Provides comprehensive metadata beyond the basics - revealing detailed insights about each file's properties and history:
Universal Fields (for multi-section docs):
section_count(int): Total number of sections in the document (pages, chapters, etc.)curr_section(int): Which section this chunk belongs to
File-Type Specific Information:
- PDF Files: Includes
title,author,creator,producer,publisher,created,modified, andpage_countfields (powered by pdfminer.six) - EPUB Files: Dublin Core metadata including
title,creator,contributor,publisher,date, and `rights - DOCX Files: Core properties like
title,author,publisher,last_modified_by,created,modified,rights, andversion - ODT Files: Dublin Core and OpenDocument metadata including
title,creator,initial_creator,created,chapter, andauthor(powered by odfpy)
Safety First with Optional Fields!
These metadata fields are optional - not every document fills them out. Use chunk.metadata.get("author") instead of chunk.metadata["author"] to avoid KeyErrors when a field is missing. Better safe than sorry! 😉
CodeChunker Metadata: Code Intelligence 💻
The CodeChunker provides code-specific insights beyond basic metadata. It helps you understand the structural context of each chunk - perfect for tracking where your code elements originated! 🔍
Code-Specific Information:
tree(str): Abstract syntax tree representation showing structural relationships within the chunkstart_line(int): Line number where this chunk begins in the original fileend_line(int): Line number where this chunk ends in the original file
Automatically included in every Box object when chunking code, helping you understand which functions, classes, or code blocks are in each chunk.
CLI Metadata Output: Command Line Insights 🖥️
The chunklet CLI adapts metadata output based on your input type and flags. Think of it as your CLI's helpful companion that provides just the right context!
Metadata Control: The --metadata flag gives you control over what gets included.
- With
--metadata: Your chunks come with their full context - metadata appears alongside content, either printed to stdout or saved in.jsonfiles with--destination - Without
--metadata: Just the chunk content - clean and simple when you want to focus purely on the text
Metadata by Input Type:
- Direct Text Input (
chunklet chunk "Your text..."): UsesDocumentChunkerwith essential Common Metadata fields (chunk_num,span, ...) andsourceset to"stdin" - Document Processing (
chunklet chunk --doc --source document.pdf):DocumentChunkerprovides rich document metadata including Common Metadata plus file-specific details (PDF titles, EPUB creators, DOCX authors, ODT creators) as detailed in DocumentChunker Metadata - Code Processing (
chunklet chunk --code --source code.py):CodeChunkerincludes structural information with Common Metadata and code-specific fields liketree,start_line,end_lineas described in CodeChunker Metadata
The CLI automatically provides the most relevant metadata for your use case - making chunk analysis both powerful and intuitive. Smart and simple! 🎯