Skip to content

Custom Tokenizers

Custom Tokenizers

Why Custom Tokenizers?

Got a specific LLM in mind? Or maybe a unique tokenization method for your use case? Chunklet's got you covered! Our custom tokenizer support lets you plug in any tokenization logic you can imagine. Because one size definitely doesn't fit all models!

Whether you're working with GPT-4, Claude, a local model, or something totally custom - Chunklet plays nice with your tokenizer of choice! 🎯

How It Works

Chunklet passes your text to the tokenizer via STDIN and expects an integer token count on STDOUT. Simple as that!

Component Details
Input Read text from stdin
Output Print only the integer count to stdout
Language Any programming language works!

Any Language, Any Platform

Your tokenizer can be Python, JavaScript, Go, Rust, Bash, or whatever floats your boat! As long as it reads from stdin and outputs a number, you're golden. 🌟

Examples

Python - The Classic Choice

1
2
3
4
5
6
7
8
#!/usr/bin/env python3
# my_tokenizer.py
import sys
import tiktoken

text = sys.stdin.read()
encoding = tiktoken.encoding_for_model("gpt-4")
print(len(encoding.encode(text)))

JavaScript/Node.js - For the JS Fans

#!/usr/bin/env node
// my_tokenizer.js
const readline = require('readline');

const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout,
  terminal: false
});

let text = '';
rl.on('line', (line) => { text += line + '\n'; });
rl.on('close', () => {
  const tokens = text.split(/\s+/).filter(w => w.length > 0).length;
  console.log(tokens);
});

Shell/Bash - Keep It Simple!

1
2
3
4
5
#!/bin/bash
# my_tokenizer.sh
# Simple word count - works everywhere!
text=$(cat)
echo "$text" | wc -w

Go - For the Performance Nerds

// my_tokenizer.go
// Note: Go doesn't support shebangs - use interpreter prefix below
package main

import (
    "bufio"
    "fmt"
    "os"
    "strings"
)

func main() {
    reader := bufio.NewReader(os.Stdin)
    text, _ := reader.ReadString('\0')

    tokens := strings.Fields(text)
    fmt.Println(len(tokens))
}

No Extra Fluff!

Chunklet expects only the integer. No units, no explanations, no emoji - just the raw number. Otherwise, things might get a little... confused. 🤯

# ❌ Bad - extra output confuses Chunklet
print(f"Token count: {count}")

# ✅ Good - just the number
print(count)

Usage

CLI - Command Line Power!

With chunk command:

chunklet chunk --text "Your text here" \
  --max-tokens 50 \
  --tokenizer-command "python ./my_tokenizer.py"

With visualize command:

chunklet visualize \
  --tokenizer-command "python ./my_tokenizer.py" \
  --tokenizer-timeout 30

Make It Executable (with shebang)

If you're on Unix/Linux/Mac and your script has a shebang (e.g., #!/usr/bin/env python3), you can make it executable with chmod +x my_tokenizer.py and then use --tokenizer-command "./my_tokenizer.py" - no interpreter prefix needed! 🚀

Programmatic - Python Power!

from chunklet import DocumentChunker

# Your custom tokenizer function
def my_tokenizer(text: str) -> int:
    return len(text.split())  # Simple word count!

chunker = DocumentChunker(token_counter=my_tokenizer)
chunks = chunker.chunk_text(text, max_tokens=50)

for chunk in chunks:
    print(chunk.content)
Learn More