chunklet.code_chunker.code_chunker

Language-Agnostic Code Chunking Utility

This module provides a robust, convention-aware engine for segmenting source code into semantic units ("chunks") such as functions, classes, namespaces, and logical blocks. Unlike purely heuristic or grammar-dependent parsers, the CodeChunker relies on anchored, multi-language regex patterns and indentation rules to identify structures consistently across a variety of programming languages.

Limitations

CodeChunker assumes syntactically conventional code. Highly obfuscated, minified, or macro-generated sources may not fully respect its boundary patterns, though such cases fall outside its intended domain.

Inspired by

Camel.utils.chunker.CodeChunker (@ CAMEL-AI.org)
code-chunker by JimAiMoment
whats_that_code by matthewdeanmartin
CintraAI Code Chunker

Classes:

CodeChunker –

Language-agnostic code chunking utility for semantic code segmentation.

CodeChunker

CodeChunker(
    verbose: bool = False,
    token_counter: Callable[[str], int] | None = None,
)

Bases: BaseChunker

Language-agnostic code chunking utility for semantic code segmentation.

Extracts structural units (functions, classes, namespaces) from source code across multiple programming languages using pattern-based detection and token-aware segmentation.

Key Features

Cross-language support (Python, C/C++, Java, C#, JavaScript, Go, etc.)
Structural analysis with namespace hierarchy tracking
Configurable token limits with strict/lenient overflow handling
Flexible docstring and comment processing modes
Accurate line number preservation and source tracking
Parallel batch processing for multiple files
Comprehensive logging and progress tracking

Initialize the CodeChunker with optional token counter and verbosity control.

Parameters:

verbose
(bool, default: False ) –

Enable verbose logging.
token_counter
(Callable[[str], int] | None, default: None ) –

Function that counts tokens in text. If None, must be provided when calling chunk() methods.

Methods:

batch_chunk –

Process multiple source files or code strings in parallel.
chunk –

Extract semantic code chunks from source using multi-dimensional analysis.

Attributes:

verbose (bool) –

Get the verbose setting.

Source code in src/chunklet/code_chunker/code_chunker.py

@validate_input
def __init__(
    self,
    verbose: bool = False,
    token_counter: Callable[[str], int] | None = None,
):
    """
    Initialize the CodeChunker with optional token counter and verbosity control.

    Args:
        verbose (bool): Enable verbose logging.
        token_counter (Callable[[str], int] | None): Function that counts tokens in text.
            If None, must be provided when calling chunk() methods.
    """
    self.token_counter = token_counter
    self._verbose = verbose
    self.extractor = CodeStructureExtractor(verbose=self._verbose)

verbose `property` `writable`

verbose: bool

Get the verbose setting.

batch_chunk

batch_chunk(
    sources: restricted_iterable(str | Path),
    *,
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_lines: Annotated[int | None, Field(ge=5)] = None,
    max_functions: Annotated[
        int | None, Field(ge=1)
    ] = None,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    include_comments: bool = True,
    docstring_mode: Literal[
        "summary", "all", "excluded"
    ] = "all",
    strict: bool = True,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise"
) -> Generator[Box, None, None]

Process multiple source files or code strings in parallel.

Leverages multiprocessing to efficiently chunk multiple code sources, applying consistent chunking rules across all inputs.

Parameters:

sources
(restricted_iterable[str | Path]) –

A restricted iterable of file paths or raw code strings to process.
max_tokens
(int, default: None ) –

Maximum tokens per chunk. Must be >= 12.
max_lines
(int, default: None ) –

Maximum number of lines per chunk. Must be >= 5.
max_functions
(int, default: None ) –

Maximum number of functions per chunk. Must be >= 1.
token_counter
(Callable | None, default: None ) –

Token counting function. Uses instance counter if None. Required for token-based chunking.
separator
(Any, default: None ) –

A value to be yielded after the chunks of each text are processed. Note: None cannot be used as a separator.
include_comments
(bool, default: True ) –

Include comments in output chunks. Default: True.
docstring_mode
(Literal['summary', 'all', 'excluded'], default: 'all' ) –

Docstring processing strategy: - "summary": Include only first line of docstrings - "all": Include complete docstrings - "excluded": Remove all docstrings Defaults to "all"
strict
(bool, default: True ) –

If True, raise error when structural blocks exceed max_tokens. If False, split oversized blocks. Default: True.
n_jobs
(int | None, default: None ) –

Number of parallel workers. Uses all available CPUs if None.
show_progress
(bool, default: True ) –

Display progress bar during processing. Defaults to True.
on_errors
(Literal['raise', 'skip', 'break'], default: 'raise' ) –

How to handle errors during processing. Defaults to 'raise'.

Yields:

Box ( Box ) –

Box object, representing a chunk with its content and metadata. Includes: - content (str): Code content - tree (str): Namespace hierarchy - start_line (int): Starting line in original source - end_line (int): Ending line in original source - span (tuple[int, int]): Character-level span (start and end offsets) in the original source. - source_path (str): Source file path or "N/A"

Raises:

InvalidInputError –

Invalid input parameters.
MissingTokenCounterError –

No token counter available.
FileProcessingError –

Source file cannot be read.
TokenLimitError –

Structural block exceeds max_tokens in strict mode.
CallbackError –

If the token counter fails or returns an invalid type.

Source code in src/chunklet/code_chunker/code_chunker.py

@validate_input
def batch_chunk(
    self,
    sources: restricted_iterable(str | Path),
    *,
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_lines: Annotated[int | None, Field(ge=5)] = None,
    max_functions: Annotated[int | None, Field(ge=1)] = None,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    include_comments: bool = True,
    docstring_mode: Literal["summary", "all", "excluded"] = "all",
    strict: bool = True,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise",
) -> Generator[Box, None, None]:
    """
    Process multiple source files or code strings in parallel.

    Leverages multiprocessing to efficiently chunk multiple code sources,
    applying consistent chunking rules across all inputs.

    Args:
        sources (restricted_iterable[str | Path]): A restricted iterable of file paths or raw code strings to process.
        max_tokens (int, optional): Maximum tokens per chunk. Must be >= 12.
        max_lines (int, optional): Maximum number of lines per chunk. Must be >= 5.
        max_functions (int, optional): Maximum number of functions per chunk. Must be >= 1.
        token_counter (Callable | None): Token counting function. Uses instance
            counter if None. Required for token-based chunking.
        separator (Any): A value to be yielded after the chunks of each text are processed.
            Note: None cannot be used as a separator.
        include_comments (bool): Include comments in output chunks. Default: True.
        docstring_mode(Literal["summary", "all", "excluded"]): Docstring processing strategy:
            - "summary": Include only first line of docstrings
            - "all": Include complete docstrings
            - "excluded": Remove all docstrings
            Defaults to "all"
        strict (bool): If True, raise error when structural blocks exceed
            max_tokens. If False, split oversized blocks. Default: True.
        n_jobs (int | None): Number of parallel workers. Uses all available CPUs if None.
        show_progress (bool): Display progress bar during processing. Defaults to True.
        on_errors (Literal["raise", "skip", "break"]):
            How to handle errors during processing. Defaults to 'raise'.

    yields:
        Box: `Box` object, representing a chunk with its content and metadata.
            Includes:
            - content (str): Code content
            - tree (str): Namespace hierarchy
            - start_line (int): Starting line in original source
            - end_line (int): Ending line in original source
            - span (tuple[int, int]): Character-level span (start and end offsets) in the original source.
            - source_path (str): Source file path or "N/A"

    Raises:
        InvalidInputError: Invalid input parameters.
        MissingTokenCounterError: No token counter available.
        FileProcessingError: Source file cannot be read.
        TokenLimitError: Structural block exceeds max_tokens in strict mode.
        CallbackError: If the token counter fails or returns an invalid type.
    """
    chunk_func = partial(
        self.chunk,
        max_tokens=max_tokens,
        max_lines=max_lines,
        max_functions=max_functions,
        token_counter=token_counter or self.token_counter,
        include_comments=include_comments,
        docstring_mode=docstring_mode,
        strict=strict,
    )

    yield from run_in_batch(
        func=chunk_func,
        iterable_of_args=sources,
        iterable_name="sources",
        separator=separator,
        n_jobs=n_jobs,
        show_progress=show_progress,
        on_errors=on_errors,
        verbose=self.verbose,
    )

chunk

chunk(
    source: str | Path,
    *,
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_lines: Annotated[int | None, Field(ge=5)] = None,
    max_functions: Annotated[
        int | None, Field(ge=1)
    ] = None,
    token_counter: Callable[[str], int] | None = None,
    include_comments: bool = True,
    docstring_mode: Literal[
        "summary", "all", "excluded"
    ] = "all",
    strict: bool = True
) -> list[Box]

Extract semantic code chunks from source using multi-dimensional analysis.

Processes source code by identifying structural boundaries (functions, classes, namespaces) and grouping content based on multiple constraints including tokens, lines, and logical units while preserving semantic coherence.

Parameters:

source
(str | Path) –

Raw code string or file path to process.
max_tokens
(int, default: None ) –

Maximum tokens per chunk. Must be >= 12.
max_lines
(int, default: None ) –

Maximum number of lines per chunk. Must be >= 5.
max_functions
(int, default: None ) –

Maximum number of functions per chunk. Must be >= 1.
token_counter
(Callable, default: None ) –

Token counting function. Uses instance counter if None. Required for token-based chunking.
include_comments
(bool, default: True ) –

Include comments in output chunks. Default: True.
docstring_mode
(Literal['summary', 'all', 'excluded'], default: 'all' ) –

Docstring processing strategy: - "summary": Include only first line of docstrings - "all": Include complete docstrings - "excluded": Remove all docstrings Defaults to "all"
strict
(bool, default: True ) –

If True, raise error when structural blocks exceed max_tokens. If False, split oversized blocks. Default: True.

Returns:

list[Box] –

list[Box]: List of code chunks with metadata. Each Box contains: - content (str): Code content - tree (str): Namespace hierarchy - start_line (int): Starting line in original source - end_line (int): Ending line in original source - span (tuple[int, int]): Character-level span (start and end offsets) in the original source. - source_path (str): Source file path or "N/A"

Raises:

InvalidInputError –

Invalid configuration parameters.
MissingTokenCounterError –

No token counter available.
FileProcessingError –

Source file cannot be read.
TokenLimitError –

Structural block exceeds max_tokens in strict mode.
CallbackError –

If the token counter fails or returns an invalid type.

Source code in src/chunklet/code_chunker/code_chunker.py

@validate_input
def chunk(
    self,
    source: str | Path,
    *,
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_lines: Annotated[int | None, Field(ge=5)] = None,
    max_functions: Annotated[int | None, Field(ge=1)] = None,
    token_counter: Callable[[str], int] | None = None,
    include_comments: bool = True,
    docstring_mode: Literal["summary", "all", "excluded"] = "all",
    strict: bool = True,
) -> list[Box]:
    """
    Extract semantic code chunks from source using multi-dimensional analysis.

    Processes source code by identifying structural boundaries (functions, classes,
    namespaces) and grouping content based on multiple constraints including
    tokens, lines, and logical units while preserving semantic coherence.

    Args:
        source (str | Path): Raw code string or file path to process.
        max_tokens (int, optional): Maximum tokens per chunk. Must be >= 12.
        max_lines (int, optional): Maximum number of lines per chunk. Must be >= 5.
        max_functions (int, optional): Maximum number of functions per chunk. Must be >= 1.
        token_counter (Callable, optional): Token counting function. Uses instance
            counter if None. Required for token-based chunking.
        include_comments (bool): Include comments in output chunks. Default: True.
        docstring_mode(Literal["summary", "all", "excluded"]): Docstring processing strategy:
            - "summary": Include only first line of docstrings
            - "all": Include complete docstrings
            - "excluded": Remove all docstrings
            Defaults to "all"
        strict (bool): If True, raise error when structural blocks exceed
            max_tokens. If False, split oversized blocks. Default: True.

    Returns:
        list[Box]: List of code chunks with metadata. Each Box contains:
            - content (str): Code content
            - tree (str): Namespace hierarchy
            - start_line (int): Starting line in original source
            - end_line (int): Ending line in original source
            - span (tuple[int, int]): Character-level span (start and end offsets) in the original source.
            - source_path (str): Source file path or "N/A"

    Raises:
        InvalidInputError: Invalid configuration parameters.
        MissingTokenCounterError: No token counter available.
        FileProcessingError: Source file cannot be read.
        TokenLimitError: Structural block exceeds max_tokens in strict mode.
        CallbackError: If the token counter fails or returns an invalid type.
    """
    self._validate_constraints(max_tokens, max_lines, max_functions, token_counter)

    # Adjust limits for internal use
    if max_tokens is None:
        max_tokens = sys.maxsize
    if max_lines is None:
        max_lines = sys.maxsize
    if max_functions is None:
        max_functions = sys.maxsize

    token_counter = token_counter or self.token_counter

    if isinstance(source, str) and not source.strip():
        self.log_info("Input source is empty. Returning empty list.")
        return []

    self.log_info(
        "Starting chunk processing for {}",
        (
            f"source: {source}"
            if isinstance(source, Path)
            or (isinstance(source, str) and is_path_like(source))
            else f"code starting with:\n```\n{source[:100]}...\n```\n"
        ),
    )

    snippet_dicts, cumulative_lengths = self.extractor.extract_code_structure(
        source, include_comments, docstring_mode
    )

    result_chunks = self._group_by_chunk(
        snippet_dicts=snippet_dicts,
        cumulative_lengths=cumulative_lengths,
        token_counter=token_counter,
        max_tokens=max_tokens,
        max_lines=max_lines,
        max_functions=max_functions,
        strict=strict,
        source=source,
    )

    self.log_info(
        "Generated {} chunk(s) for the {}",
        len(result_chunks),
        (
            f"source: {source}"
            if isinstance(source, Path)
            or (isinstance(source, str) and is_path_like(source))
            else f"code starting with:\n```\n{source[:100]}...\n```\n"
        ),
    )

    return result_chunks

chunklet.code_chunker.code_chunker

Limitations

CodeChunker

`verbose`

`token_counter`

verbose `property` `writable`

batch_chunk

`sources`

`max_tokens`

`max_lines`

`max_functions`

`token_counter`

`separator`

`include_comments`

`docstring_mode`

`strict`

`n_jobs`

`show_progress`

`on_errors`

chunk

`source`

`max_tokens`

`max_lines`

`max_functions`

`token_counter`

`include_comments`

`docstring_mode`

`strict`

chunklet.code_chunker.code_chunker

Limitations

CodeChunker

verbose

token_counter

verbose property writable

batch_chunk

sources

max_tokens

max_lines

max_functions

token_counter

separator

include_comments

docstring_mode

strict

n_jobs

show_progress

on_errors

chunk

source

max_tokens

max_lines

max_functions

token_counter

include_comments

docstring_mode

strict

`verbose`

`token_counter`

verbose `property` `writable`

`sources`

`max_tokens`

`max_lines`

`max_functions`

`token_counter`

`separator`

`include_comments`

`docstring_mode`

`strict`

`n_jobs`

`show_progress`

`on_errors`

`source`

`max_tokens`

`max_lines`

`max_functions`

`token_counter`

`include_comments`

`docstring_mode`

`strict`