Code Chunker
Quick Install
This installs all the code processing dependencies needed for language-agnostic code chunking! 💻
Code Chunker: Your Code Intelligence Sidekick!
Ever stared at a massive codebase feeling like you're decoding ancient hieroglyphs? The CodeChunker is your trusty code companion that transforms tangled functions and classes into clean, understandable chunks that actually make sense!
Forget basic regex hacks! This language-agnostic wizard uses clever patterns to identify functions, classes, and logical blocks across Python, JavaScript, Java, C++, and more - no PhD required.
Language-agnostic and lightweight - ideal for code understanding and generation tasks, analysis, documentation, and AI model training.
Code Chunker Superpowers! ⚡
The CodeChunker comes packed with smart features for your coding adventures:
- Rule-Based and Language-Agnostic: Uses universal patterns to spot code blocks, working with tons of languages out of the box - Python, C++, Java, JavaScript, and more!
- Convention-Aware: Assumes your code follows standard formatting - no full language parsers needed for surprisingly accurate results!
- Structurally Neutral: Handles mixed-language code like a pro - SQL in Python? JavaScript in HTML? No problem, it treats them as part of the block.
- Flexible Constraint-Based Chunking: Ultimate control over code segmentation! Mix and match limits based on tokens, lines, or functions for perfect chunks.
- Annotation-Aware: Smart about comments and docstrings - uses them to better understand your code's structure.
- Flexible Source Input: Feed it code as strings, file paths, or
pathlib.Pathobjects. File paths? It'll read them automatically! - Strict Mode Control: By default protects structural blocks from being split (even if they exceed limits), throwing a
TokenLimitError. Want more flexibility? Setstrict=False.
Code Constraints: Your Chunking Control Panel! 🎛️
CodeChunker works primarily in structural mode, letting you set chunk boundaries based on code structure. Fine-tune your chunks with these constraint options:
| Constraint | Value Requirement | Description |
|---|---|---|
max_tokens |
int >= 12 |
Token budget master! Code blocks exceeding this limit get split at smart structural boundaries. |
max_lines |
int >= 5 |
Line count commander! Perfect for managing chunks where line numbers often match logical code units. |
max_functions |
int >= 1 |
Function group guru! Keeps related functions together or splits them when you hit the limit. |
Constraint Must-Have!
You must specify at least one limit (max_tokens, max_lines, or max_functions) when using chunk or batch_chunk. Skip this and you'll get an InvalidInputError - rules are rules!
The CodeChunker has two main methods: chunk for single code inputs and batch_chunk for processing multiple codes. chunk returns a list of Box objects, while batch_chunk returns a generator that yields a Box object for each chunk. Each Box has content (str) and metadata (dict). For detailed information about metadata structure and usage, see the Metadata guide.
Single Run:
Let's see CodeChunker in action with a single code input. The flexible source parameter accepts:
- Raw code as a string
- File path as a string
pathlib.Pathobject
When you provide a file path, CodeChunker automatically handles reading the file for you!
Chunking by Lines: Line Count Control! 📏
Ready to chunk code by line count? This gives you predictable, size-based chunks:
- Sets the maximum number of lines per chunk. If a code block exceeds this limit, it will be split.
- Set to True to include comments in the output chunks. Defaults to True.
docstring_mode="all"ensures that complete docstrings, with all their multi-line details, are preserved in the code chunks. Other options are"summary"to include only the first line, or"excluded"to remove them entirely. Default is "all".- When
strict=False, structural blocks (like functions or classes) that exceed the limit set will be split into smaller chunks. Ifstrict=True(default), aTokenLimitErrorwould be raised instead.
Click to show output
--- Chunk 1 ---
Content:
"""
Module docstring
"""
import os
Metadata:
chunk_num: 1
tree: global
start_line: 1
end_line: 7
span: (0, 38)
source: N/A
--- Chunk 2 ---
Content:
class Calculator:
"""
A simple calculator class.
A calculator that Contains basic arithmetic operations for demonstration purposes.
"""
Metadata:
chunk_num: 2
tree: global
└─ class Calculator
start_line: 8
end_line: 14
span: (38, 192)
source: N/A
--- Chunk 3 ---
Content:
def add(self, x, y):
"""Add two numbers and return result.
This is a longer description that should be truncated
in summary mode. It has multiple lines and details.
"""
result = x + y
return result
Metadata:
chunk_num: 3
tree: global
└─ class Calculator
└─ def add(
start_line: 15
end_line: 23
span: (192, 444)
source: N/A
--- Chunk 4 ---
Content:
def multiply(self, x, y):
# Multiply two numbers
return x * y
def standalone_function():
"""A standalone function."""
return True
Metadata:
chunk_num: 4
tree: global
├─ class Calculator
│ └─ def multiply(
└─ def standalone_function(
start_line: 24
end_line: 30
span: (444, 603)
source: N/A
Enable Verbose Logging
To see detailed logging during the chunking process, you can set the verbose parameter to True when initializing the CodeChunker:
Chunking by Tokens: Token Budget Master! 🪙
Here's how you can use CodeChunker to chunk code by the number of tokens:
Click to show output
--- Chunk 1 ---
Content:
"""
Module docstring
"""
import os
class Calculator:
"""
A simple calculator class.
A calculator that Contains basic arithmetic operations for demonstration purposes.
"""
Metadata:
chunk_num: 1
tree: global
└─ class Calculator
start_line: 1
end_line: 14
span: (0, 192)
source: N/A
--- Chunk 2 ---
Content:
def add(self, x, y):
"""Add two numbers and return result.
This is a longer description that should be truncated
in summary mode. It has multiple lines and details.
"""
result = x + y
return result
def multiply(self, x, y):
# Multiply two numbers
return x * y
Metadata:
chunk_num: 2
tree: global
└─ class Calculator
├─ def add(
└─ def multiply(
start_line: 15
end_line: 27
span: (192, 527)
source: N/A
--- Chunk 3 ---
Content:
def standalone_function():
"""A standalone function."""
return True
Metadata:
chunk_num: 3
tree: global
└─ def standalone_function(
start_line: 28
end_line: 30
span: (527, 603)
source: N/A
Overrides token_counter
You can also provide the token_counter directly to the chunk method. within the chunk method call (e.g., chunker.chunk(..., token_counter=my_tokenizer_function)). If a token_counter is provided in both the constructor and the chunk method, the one in the chunk method will be used.
Chunking by Functions: Function Group Guru! 👥
This constraint is useful when you want to ensure that each chunk contains a specific number of functions, helping to maintain logical code units.
Click to show output
--- Chunk 1 ---
Content:
"""
Module docstring
"""
import os
class Calculator:
"""
A simple calculator class.
A calculator that Contains basic arithmetic operations for demonstration purposes.
"""
def add(self, x, y):
"""Add two numbers and return result.
This is a longer description that should be truncated
in summary mode. It has multiple lines and details.
"""
result = x + y
return result
Metadata:
chunk_num: 1
tree: global
└─ class Calculator
└─ def add(
start_line: 1
end_line: 23
span: (0, 444)
source: N/A
--- Chunk 2 ---
Content:
def multiply(self, x, y):
return x * y
Metadata:
chunk_num: 2
tree: global
└─ class Calculator
└─ def multiply(
start_line: 24
end_line: 27
span: (444, 527)
source: N/A
--- Chunk 3 ---
Content:
def standalone_function():
"""A standalone function."""
return True
Metadata:
chunk_num: 3
tree: global
└─ def standalone_function(
start_line: 28
end_line: 30
span: (527, 603)
source: N/A
Combining Multiple Constraints: Mix and Match Magic! 🎭
The real power of CodeChunker comes from combining multiple constraints. This allows for highly specific and granular control over how your code is chunked. Here are a few examples of how you can combine different constraints.
By Lines and Tokens
This is useful when you want to limit by both the number of lines and the overall token count, whichever is reached first.
By Lines and Functions
This combination is great for ensuring that chunks don't span across too many functions while also keeping the line count in check.
By Tokens and Functions
A powerful combination for structured code where you want to respect function boundaries while adhering to a strict token budget.
By Lines, Tokens, and Functions
For the ultimate level of control, you can combine all three constraints. The chunking will stop as soon as any of the three limits is reached.
Batch Run: Processing Multiple Code Files Like a Pro! 📚
While chunk is perfect for single code inputs, batch_chunk is your power player for processing multiple code files in parallel. It uses a memory-friendly generator so you can handle massive codebases with ease.
Given we have the following code snippets saved as individual files in a code_examples directory:
cpp_calculator.cpp
#include <iostream>
#include <string>
// Function 1: Simple greeting
void say_hello(std::string name) {
std::cout << "Hello, " << name << std::endl;
}
// Function 2: Logic block
int calculate_sum(int a, int b) {
if (a < 0 || b < 0) {
return -1; // Error code
}
int result = a + b;
return result;
}
JavaDataProcessor.java
package com.chunker.data;
public class DataProcessor {
private String sourcePath;
// Constructor
public DataProcessor(String path) {
this.sourcePath = path;
}
// Method 1: Getter
public String getPath() {
return this.sourcePath;
}
// Method 2: Core processing logic
public boolean process() {
if (this.sourcePath.isEmpty()) {
return false;
}
// Assume processing logic here
return true;
}
}
js_utils.js
// Utility function
const sanitizeInput = (input) => {
return input.trim().substring(0, 100);
};
// Main function with control flow
function processArray(data) {
if (!data || data.length === 0) {
return 0;
}
let total = 0;
// Loop structure
for (let i = 0; i < data.length; i++) {
total += data[i];
}
return total;
}
go_config.go
package main
import (
"fmt"
)
// Struct definition
type Config struct {
Timeout int
Retries int
}
// Function 1: Factory function
func NewConfig() Config {
return Config{
Timeout: 5000,
Retries: 3,
}
}
// Function 2: Method on the struct
func (c *Config) displayInfo() {
fmt.Printf("Timeout: %dms, Retries: %d\\n", c.Timeout, c.Retries)
}
We can process them all at once by providing a list of paths to the batch_chunk method. Assuming these files are saved in a code_examples directory:
- Specifies the number of parallel processes to use for chunking. The default value is
None(use all available CPU cores). - Define how to handle errors during processing. Determines how errors during chunking are handled. If set to
"raise"(default), an exception will be raised immediately. If set to"break", the process will be halt and partial result will be returned. If set to"ignore", errors will be silently ignored. - Display a progress bar during batch processing. The default value is
False.
Click to view output
Chunking ...: 0%| | 0/4 [00:00, ?it/s]
--- Chunk 1 ---
Content:
#include <iostream>
#include <string>
void say_hello(std::string name) {
std::cout << "Hello, " << name << std::endl;
}
int calculate_sum(int a, int b) {
if (a < 0 || b < 0) {
return -1;
}
int result = a + b;
return result;
}
Metadata:
chunk_num: 1
tree: global
start_line: 1
end_line: 17
span: (0, 329)
source: code_examples/cpp_calculator.cpp
Chunking ...: 50%|███████████████ | 2/4 [00:00, 19.73it/s]
--- Chunk 2 ---
Content:
const sanitizeInput = (input) => {
return input.trim().substring(0, 100);
};
function processArray(data) {
if (!data || data.length === 0) {
return 0;
}
let total = 0;
for (let i = 0; i < data.length; i++) {
total += data[i];
}
return total;
}
Metadata:
chunk_num: 1
tree: global
└─ function processArray(
start_line: 1
end_line: 19
span: (0, 372)
source: code_examples/js_utils.js
--- Chunk 3 ---
Content:
package com.chunker.data;
public class DataProcessor {
private String sourcePath;
public DataProcessor(String path) {
this.sourcePath = path;
}
public String getPath() {
return this.sourcePath;
}
public boolean process() {
if (this.sourcePath.isEmpty()) {
return false;
}
return true;
}
}
Metadata:
chunk_num: 1
tree: global
├─ package com
└─ public class DataProcessor
├─ public DataProcessor(
├─ public String getPath(
└─ public boolean process(
start_line: 1
end_line: 25
span: (0, 500)
source: code_examples/JavaDataProcessor.java
Chunking ...: 50%|███████████████ | 2/4 [00:00, 19.73it/s]
--- Chunk 4 ---
Content:
package main
import (
"fmt"
)
type Config struct {
Timeout int
Retries int
}
func NewConfig() Config {
return Config{
Timeout: 5000,
Retries: 3,
}
}
func (c *Config) displayInfo() {
fmt.Printf("Timeout: %dms, Retries: %d\n", c.Timeout, c.Retries)
}
Metadata:
chunk_num: 1
tree: global
├─ package main
├─ type Config
└─ func NewConfig(
start_line: 1
end_line: 26
span: (0, 382)
source: code_examples/go_config.go
Chunking ...: 100%|██████████████████████████████| 4/4 [00:00, 19.71it/s]
Generator Cleanup
When using batch_chunk, it's crucial to ensure the generator is properly closed, especially if you don't iterate through all the chunks. This is necessary to release the underlying multiprocessing resources. The recommended way is to use a try...finally block to call close() on the generator. For more details, see the Troubleshooting guide.
Separator: Keeping Your Code Batches Organized! 📋
The separator parameter lets you add a custom marker that gets yielded after all chunks from a single code file are processed. Super handy for batch processing when you want to clearly separate chunks from different source files.
note
None cannot be used as a separator.
- Avoid processing the empty list at the end if stream ends with separator
Click to show output
Chunking ...: 0%| | 0/2 [00:00, ?it/s]
--- Chunks for Document 1 ---
Content:
def greet_user(name):
"""Returns a simple greeting string."""
message = "Welcome back, " + name
return message
Metadata: {'chunk_num': 1, 'tree': 'global\n└─ def greet_user(\n', 'start_line': 1, 'end_line': 5, 'span': (0, 124), 'source': 'N/A'}
Chunking ...: 50%|███████████████ | 1/2 [00:00, 9.48it/s]
--- Chunks for Document 2 ---
Content:
public class Utility
{
// C# Method
Metadata: {'chunk_num': 1, 'tree': 'global\n└─ public class Utility\n', 'start_line': 1, 'end_line': 4, 'span': (0, 41), 'source': 'N/A'}
Content:
public int Add(int x, int y)
{
int sum = x + y;
return sum;
}
}
Metadata: {'chunk_num': 2, 'tree': 'global\n└─ public class Utility\n └─ public int Add(\n', 'start_line': 5, 'end_line': 10, 'span': (41, 133), 'source': 'N/A'}
Chunking ...: 100%|██████████████████████████████| 2/2 [00:00, 1.92it/s]
What are the limitations of CodeChunker?
While powerful, CodeChunker isn't magic! It assumes your code is reasonably well-behaved (syntactically conventional). Highly obfuscated, minified, or macro-generated sources might give it a headache. Also, nested docstrings or comment blocks can be a bit tricky for it to handle perfectly.
Inspiration: The Code Behind the Magic! ✨
The CodeChunker draws inspiration from various projects and concepts in the field of code analysis and segmentation. These influences have shaped its design principles and capabilities:
- code_chunker by Camel AI
- code_chunker by JimAiMoment
- whats_that_code by matthewdeanmartin
- CintraAI Code Chunker
API Reference
For complete technical details on the CodeChunker class, check out the API documentation.