Chunklet Command Line Interface (CLI): Your Chunking Powerhouse! 🚀
Welcome to the chunklet CLI, your one-stop shop for all things text splitting and intelligent chunking. Whether you're segmenting sentences, dicing up documents, or carving code into semantic blocks, chunklet has your back. It's designed to be super flexible and RAG-ready, making your LLM workflows smoother than ever.
Before we dive into the fun stuff, you can always check your chunklet version or get a quick help guide.:
You can also get specific help for each command
The split Command: Precision Sentence Segmentation ✂️
Need to break down text into individual sentences with surgical precision? The split command is your go-to! It leverages chunklet's powerful SentenceSplitter to give you clean, segmented sentences.
Quick Facts for split
- Operates on a raw text string or a single file (
--source). - Outputs sentences separated by newline characters.
- Perfect for preprocessing text before more complex chunking.
| Flag | Description | Default |
|---|---|---|
<TEXT> |
The input text to split. If not provided, --source must be used. |
None |
--source, -s <PATH> |
Path to a single file to read input from. Cannot be a directory. | None |
--destination, -d <PATH> |
Path to a single file to write the segmented sentences. If not provided, output goes to STDOUT. | STDOUT |
--lang |
Language of the text (e.g., 'en', 'fr', 'auto'). Use 'auto' for automatic detection. | auto |
--verbose, -v |
Enable verbose logging for extra insights. | False |
Scenarios: Splitting Like a Pro!
Scenario 1: Splitting Text Directly (and Multilingually!)
Segment a direct text input containing multiple languages into individual sentences, leveraging automatic language detection.
chunklet split "This is the first sentence. Here is the second sentence, in French. C'est la vie! ¿Cómo estás?" --lang auto
Scenario 2: Splitting a File and Saving the Output
Process a document and save its segmented sentences to a new file. Easy peasy!
The chunk Command: Your Intelligent Chunking Workhorse!
The chunk command is where the real magic happens! It's your versatile tool for breaking down text, documents, and even code into RAG-ready chunks. The "flavor" of chunking (plain text, document, or code) is determined by the flags you provide.
Key Flags for chunk (The Essentials!)
| Flag | Description | Default |
|---|---|---|
<TEXT> |
The input text to chunk. If not provided, --source must be used. |
None |
--source, -s <PATH> |
Path(s) to one or more files or directories to read input from. Repeat for multiple sources (e.g., -s file1.txt -s dir/). |
None |
--destination, -d <PATH> |
Path to a file (for single output) or a directory (for batch output) to write the chunks. If not provided, output goes to STDOUT. | STDOUT |
--max-tokens |
Maximum number of tokens per chunk. Applies to all chunking strategies. (Must be >= 12) | None |
--max-sentences |
Maximum number of sentences per chunk. Applies to PlainTextChunker and DocumentChunker. (Must be >= 1) | None |
--max-section-breaks |
Maximum number of section breaks per chunk. Applies to PlainTextChunker and DocumentChunker. (Must be >= 1) | None |
--overlap-percent |
Percentage of overlap between chunks (0-85). Applies to PlainTextChunker and DocumentChunker. | 20.0 |
--offset |
Starting sentence offset for chunking. Applies to PlainTextChunker and DocumentChunker. | 0 |
--lang |
Language of the text (e.g., 'en', 'fr', 'auto'). (default: auto) | auto |
--metadata |
Include rich metadata (source, span, chunk num, etc.) in the output. If --destination is a directory, metadata is saved as separate .json files; otherwise, it's included inline in the output. |
False |
--verbose, -v |
Enable verbose logging for extra insights. | False |
General Text & Document Chunking (Default or with --doc) 📄
This is your bread-and-butter chunking for everyday text and diverse document types.
- Default Behavior: If neither
--docnor--codeis specified,chunkletuses the PlainTextChunker for direct text input. ThePlainTextChunkeris designed to transform unruly text into perfectly sized, context-aware chunks. - Document Power-Up: Activate the DocumentChunker with the
--docflag to process.pdf,.docx,.epub,.txt,.tex,.html,.hml,.md,.rst, and.rtffiles! It intelligently extracts text and then applies the same robust chunking logic.
Key Flags for Document Power-Up
| Flag | Description | Default |
|---|---|---|
--doc |
Activate the DocumentChunker for multi-format file processing. |
False |
--n-jobs |
Number of parallel jobs for batch processing. (None uses all available cores) | None |
--on-errors |
How to handle errors during batch processing: raise (stop), skip (ignore file, continue), or break (halt, return partial result). |
raise |
Scenarios: Text & Document Chunking in Action!
Scenario 1: Basic Text Chunking with Token Limits and Overlap
Chunk a long text string into segments, ensuring no chunk exceeds 200 tokens, with a healthy 15% overlap for context.
chunklet chunk "The quick brown fox jumps over the lazy dog. This is the first sentence. The second sentence is a bit longer. And this is the third one. Finally, the fourth sentence concludes our example. The last sentence is here to finish the text."
--max-tokens 200 \
--overlap-percent 15
Scenario 2: Chunking a PDF Document with Sentence and Section Break Limits
Process a PDF document, ensuring chunks are no more than 10 sentences or 2 section breaks, and save the output to a file.
chunklet chunk --doc --source my_report.pdf \
--max-sentences 10 \
--max-section-breaks 2 \
--destination processed_report_chunks.txt
Scenario 3: Batch Processing a Directory of Documents (with Error Handling!)
Process all supported documents within a directory, saving the chunks to a new folder. If any file causes an error, chunklet will gracefully skip it and continue!
chunklet chunk --doc \
--source /path/to/my/project_docs \
--destination ./processed_chunks \
--n-jobs 4 \
--on-errors skip \
--max-tokens 1024 \
--metadata # Don't forget your metadata!
Scenario 4: Chunking a Text File with a Specific Language and Metadata
Chunk a French text file, limiting by tokens, and include all the juicy metadata for later analysis.
Code Chunking (with --code) 🧑💻
For the developers, by the developers! The CodeChunker is a language-agnostic wizard that breaks your source code into semantically meaningful blocks (functions, classes, etc.). Activate it with the --code flag.
- Heads Up! This mode is primarily token-based.
--max-sentences,--max-section-breaks, and--overlap-percentare generally ignored here, as code structure takes precedence.
Key Flags for Code Chunking
| Flag | Description | Default |
|---|---|---|
--code |
Activate the CodeChunker for structurally-aware code segmentation. |
False |
--max-lines |
Maximum number of lines per chunk. (Must be >= 5) | None |
--max-functions |
Maximum number of functions per chunk. (Must be >= 1) | None |
--docstring-mode |
Docstring processing strategy: summary (first line), all, or excluded. |
all |
--strict |
If True, raise an error when structural blocks exceed --max-tokens. If False, split oversized blocks. |
True |
--include-comments |
Include comments in output chunks. | True |
Scenarios: Code Chunking in Action!
Scenario 1: Chunking a Single Python File, Excluding Comments
Get a clean, comment-free view of your code's structure. Perfect for quick reviews!
Scenario 2: Batch Chunking a Codebase, Allowing Oversized Blocks
Process an entire code repository, letting chunklet split any functions or classes that are just too long, and save everything to a dedicated folder.
chunklet chunk --code \
--source ./my_awesome_repo \
--destination ./code_chunks \
--max-tokens 1024 \
--strict False \
--n-jobs 8 \
Scenario 3: Extracting Function Summaries (Docstring Mode: Summary)
Focus on the "what" of your functions by only including the first line of their docstrings.
Scenario 4: Chunking by Lines and Functions for Granular Control
For super-fine-grained control, chunk a file by both maximum lines and maximum functions per chunk.
🛠️ Advanced System Hooks
These flags are your secret weapons for scaling up operations, integrating with external tools, and getting the most out of your chunked data. They apply to the chunk command.
System Hook Flags
| Flag | Description | Default |
|---|---|---|
--tokenizer-command |
A shell command string for token counting. It must take text via STDIN and output the integer count via STDOUT. | None |
--n-jobs |
Number of parallel processes to use during batch operations. (None uses all available CPU cores) | None |
--on-errors |
Defines batch error handling: raise (stop), skip (ignore file, continue), or break (halt, return partial result). |
raise |
--metadata |
Include rich metadata (source, span, chunk num, etc.) in the output. If --destination is a directory, metadata is saved as separate .json files; otherwise, it's included inline in the output. |
False |
--verbose, -v |
Enable verbose logging for debugging or process detail. | False |
Scenarios: Unleashing Advanced Power!
Scenario 1: Verbose Debugging for a Single File
When things get tricky, crank up the verbosity to see exactly what chunklet is doing under the hood while chunking a specific file.
Scenario 2: Batch Processing with Parallelism and Error Skipping
Process a large collection of diverse documents, leveraging all your CPU cores, and gracefully skip any problematic files without halting the entire operation. Plus, get all the metadata!
chunklet chunk --doc \
--source /path/to/massive_document_archive \
--destination ./final_chunks \
--n-jobs -1 # Use all available cores!
--on-errors skip \
--max-tokens 512 \
--metadata
Scenario 3: Processing Multiple Specific Files with Advanced Hooks
Process a selection of individual files, explicitly listing each one, and apply advanced chunking parameters. This demonstrates how to handle a non-directory batch of files, ensuring each is processed with metadata and error handling.
chunklet chunk --doc \
--source my_document.pdf \
--source another_report.docx \
--source plain_text_notes.txt \
--destination ./processed_specific_files \
--max-tokens 700 \
--metadata \
--on-errors skip
Scenario 4: Custom Token Counting with an External Script
Align chunklet's chunk sizes perfectly with your LLM's token limits using any external tokenizer you can imagine!
Create your external script (e.g., my_llm_tokenizer.py):
# my_llm_tokenizer.py
import sys
import tiktoken # Or your LLM's specific tokenizer library
# Read text from stdin
text = sys.stdin.read()
# Replace with your actual token counting logic (e.g., for OpenAI's GPT models)
encoding = tiktoken.encoding_for_model("gpt-4")
token_count = len(encoding.encode(text))
print(token_count) # Must print only the integer count
Now, run chunklet with your custom tokenizer:
chunklet chunk \
--text "This is a super important piece of text that needs precise token counting for my large language model."
--max-tokens 50 \
--tokenizer-command "python ./my_llm_tokenizer.py" \
--metadata
Diving Deeper into Metadata
Want to know exactly what kind of rich context chunklet attaches to your chunks? From source paths and character spans to document-specific properties and code AST details.
👉 Head over to the Metadata in Chunklet-py guide to unlock all its secrets!
API Reference
For a deep dive into the chunklet CLI, its commands, and all the nitty-gritty details, check out the full API documentation