Custom Tokenizers
Why Custom Tokenizers?
Got a specific LLM in mind? Or maybe a unique tokenization method for your use case? Chunklet's got you covered! Our custom tokenizer support lets you plug in any tokenization logic you can imagine. Because one size definitely doesn't fit all models!
Whether you're working with GPT-4, Claude, a local model, or something totally custom - Chunklet plays nice with your tokenizer of choice! 🎯
How It Works
Chunklet passes your text to the tokenizer via STDIN and expects an integer token count on STDOUT. Simple as that!
| Component | Details |
|---|---|
| Input | Read text from stdin |
| Output | Print only the integer count to stdout |
| Language | Any programming language works! |
Any Language, Any Platform
Your tokenizer can be Python, JavaScript, Go, Rust, Bash, or whatever floats your boat! As long as it reads from stdin and outputs a number, you're golden. 🌟
Examples
Python - The Classic Choice
JavaScript/Node.js - For the JS Fans
Shell/Bash - Keep It Simple!
Go - For the Performance Nerds
No Extra Fluff!
Chunklet expects only the integer. No units, no explanations, no emoji - just the raw number. Otherwise, things might get a little... confused. 🤯
Usage
CLI - Command Line Power!
With chunk command:
chunklet chunk --text "Your text here" \
--max-tokens 50 \
--tokenizer-command "python ./my_tokenizer.py"
With visualize command:
Make It Executable (with shebang)
If you're on Unix/Linux/Mac and your script has a shebang (e.g., #!/usr/bin/env python3), you can make it executable with chmod +x my_tokenizer.py and then use --tokenizer-command "./my_tokenizer.py" - no interpreter prefix needed! 🚀
Programmatic - Python Power!
Learn More
- Beginner's Intro to Reading from Standard Input - for understanding stdin basics in python
- CLI Documentation for command-line usage
- Document Chunker for multi-format document processing
- DocumentChunker API for programmatic usage