Skip to content

chunklet.code_chunker.code_chunker

Author: Speedyk-005 | Copyright (c) 2025 | License: MIT

Language-Agnostic Code Chunking Utility

This module provides a robust, convention-aware engine for segmenting source code into semantic units ("chunks") such as functions, classes, namespaces, and logical blocks. Unlike purely heuristic or grammar-dependent parsers, the CodeChunker relies on anchored, multi-language regex patterns and indentation rules to identify structures consistently across a variety of programming languages.

Limitations

CodeChunker assumes syntactically conventional code. Highly obfuscated, minified, or macro-generated sources may not fully respect its boundary patterns, though such cases fall outside its intended domain.

Inspired by
  • Camel.utils.chunker.CodeChunker (@ CAMEL-AI.org)
  • code-chunker by JimAiMoment
  • whats_that_code by matthewdeanmartin
  • CintraAI Code Chunker

Classes:

  • CodeChunker

    Language-agnostic code chunking utility for semantic code segmentation.

CodeChunker

CodeChunker(
    verbose: bool = False,
    token_counter: Callable[[str], int] | None = None,
)

Bases: BaseChunker

Language-agnostic code chunking utility for semantic code segmentation.

Extracts structural units (functions, classes, namespaces) from source code across multiple programming languages using pattern-based detection and token-aware segmentation.

Key Features
  • Cross-language support (Python, C/C++, Java, C#, JavaScript, Go, etc.)
  • Structural analysis with namespace hierarchy tracking
  • Configurable token limits with strict/lenient overflow handling
  • Flexible docstring and comment processing modes
  • Accurate line number preservation and source tracking
  • Parallel batch processing for multiple files
  • Comprehensive logging and progress tracking

Initialize the CodeChunker with optional token counter and verbosity control.

Parameters:

  • verbose

    (bool, default: False ) –

    Enable verbose logging.

  • token_counter

    (Callable[[str], int] | None, default: None ) –

    Function that counts tokens in text. If None, must be provided when calling chunk() methods.

Methods:

  • batch_chunk

    Process multiple source files or code strings in parallel.

  • chunk

    Extract semantic code chunks from source using multi-dimensional analysis.

Attributes:

  • verbose (bool) –

    Get the verbose setting.

Source code in src/chunklet/code_chunker/code_chunker.py
@validate_input
def __init__(
    self,
    verbose: bool = False,
    token_counter: Callable[[str], int] | None = None,
):
    """
    Initialize the CodeChunker with optional token counter and verbosity control.

    Args:
        verbose (bool): Enable verbose logging.
        token_counter (Callable[[str], int] | None): Function that counts tokens in text.
            If None, must be provided when calling chunk() methods.
    """
    self.token_counter = token_counter
    self._verbose = verbose
    self.extractor = CodeStructureExtractor(verbose=self._verbose)

verbose property writable

verbose: bool

Get the verbose setting.

batch_chunk

batch_chunk(
    sources: restricted_iterable(str | Path),
    *,
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_lines: Annotated[int | None, Field(ge=5)] = None,
    max_functions: Annotated[
        int | None, Field(ge=1)
    ] = None,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    include_comments: bool = True,
    docstring_mode: Literal[
        "summary", "all", "excluded"
    ] = "all",
    strict: bool = True,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise"
) -> Generator[Box, None, None]

Process multiple source files or code strings in parallel.

Leverages multiprocessing to efficiently chunk multiple code sources, applying consistent chunking rules across all inputs.

Parameters:

  • sources

    (restricted_iterable[str | Path]) –

    A restricted iterable of file paths or raw code strings to process.

  • max_tokens

    (int, default: None ) –

    Maximum tokens per chunk. Must be >= 12.

  • max_lines

    (int, default: None ) –

    Maximum number of lines per chunk. Must be >= 5.

  • max_functions

    (int, default: None ) –

    Maximum number of functions per chunk. Must be >= 1.

  • token_counter

    (Callable | None, default: None ) –

    Token counting function. Uses instance counter if None. Required for token-based chunking.

  • separator

    (Any, default: None ) –

    A value to be yielded after the chunks of each text are processed. Note: None cannot be used as a separator.

  • include_comments

    (bool, default: True ) –

    Include comments in output chunks. Default: True.

  • docstring_mode

    (Literal['summary', 'all', 'excluded'], default: 'all' ) –

    Docstring processing strategy: - "summary": Include only first line of docstrings - "all": Include complete docstrings - "excluded": Remove all docstrings Defaults to "all"

  • strict

    (bool, default: True ) –

    If True, raise error when structural blocks exceed max_tokens. If False, split oversized blocks. Default: True.

  • n_jobs

    (int | None, default: None ) –

    Number of parallel workers. Uses all available CPUs if None.

  • show_progress

    (bool, default: True ) –

    Display progress bar during processing. Defaults to True.

  • on_errors

    (Literal['raise', 'skip', 'break'], default: 'raise' ) –

    How to handle errors during processing. Defaults to 'raise'.

Yields:

  • Box ( Box ) –

    Box object, representing a chunk with its content and metadata. Includes: - content (str): Code content - tree (str): Namespace hierarchy - start_line (int): Starting line in original source - end_line (int): Ending line in original source - span (tuple[int, int]): Character-level span (start and end offsets) in the original source. - source_path (str): Source file path or "N/A"

Raises:

Source code in src/chunklet/code_chunker/code_chunker.py
@validate_input
def batch_chunk(
    self,
    sources: restricted_iterable(str | Path),
    *,
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_lines: Annotated[int | None, Field(ge=5)] = None,
    max_functions: Annotated[int | None, Field(ge=1)] = None,
    token_counter: Callable[[str], int] | None = None,
    separator: Any = None,
    include_comments: bool = True,
    docstring_mode: Literal["summary", "all", "excluded"] = "all",
    strict: bool = True,
    n_jobs: Annotated[int, Field(ge=1)] | None = None,
    show_progress: bool = True,
    on_errors: Literal["raise", "skip", "break"] = "raise",
) -> Generator[Box, None, None]:
    """
    Process multiple source files or code strings in parallel.

    Leverages multiprocessing to efficiently chunk multiple code sources,
    applying consistent chunking rules across all inputs.

    Args:
        sources (restricted_iterable[str | Path]): A restricted iterable of file paths or raw code strings to process.
        max_tokens (int, optional): Maximum tokens per chunk. Must be >= 12.
        max_lines (int, optional): Maximum number of lines per chunk. Must be >= 5.
        max_functions (int, optional): Maximum number of functions per chunk. Must be >= 1.
        token_counter (Callable | None): Token counting function. Uses instance
            counter if None. Required for token-based chunking.
        separator (Any): A value to be yielded after the chunks of each text are processed.
            Note: None cannot be used as a separator.
        include_comments (bool): Include comments in output chunks. Default: True.
        docstring_mode(Literal["summary", "all", "excluded"]): Docstring processing strategy:
            - "summary": Include only first line of docstrings
            - "all": Include complete docstrings
            - "excluded": Remove all docstrings
            Defaults to "all"
        strict (bool): If True, raise error when structural blocks exceed
            max_tokens. If False, split oversized blocks. Default: True.
        n_jobs (int | None): Number of parallel workers. Uses all available CPUs if None.
        show_progress (bool): Display progress bar during processing. Defaults to True.
        on_errors (Literal["raise", "skip", "break"]):
            How to handle errors during processing. Defaults to 'raise'.

    yields:
        Box: `Box` object, representing a chunk with its content and metadata.
            Includes:
            - content (str): Code content
            - tree (str): Namespace hierarchy
            - start_line (int): Starting line in original source
            - end_line (int): Ending line in original source
            - span (tuple[int, int]): Character-level span (start and end offsets) in the original source.
            - source_path (str): Source file path or "N/A"

    Raises:
        InvalidInputError: Invalid input parameters.
        MissingTokenCounterError: No token counter available.
        FileProcessingError: Source file cannot be read.
        TokenLimitError: Structural block exceeds max_tokens in strict mode.
        CallbackError: If the token counter fails or returns an invalid type.
    """
    chunk_func = partial(
        self.chunk,
        max_tokens=max_tokens,
        max_lines=max_lines,
        max_functions=max_functions,
        token_counter=token_counter or self.token_counter,
        include_comments=include_comments,
        docstring_mode=docstring_mode,
        strict=strict,
    )

    yield from run_in_batch(
        func=chunk_func,
        iterable_of_args=sources,
        iterable_name="sources",
        separator=separator,
        n_jobs=n_jobs,
        show_progress=show_progress,
        on_errors=on_errors,
        verbose=self.verbose,
    )

chunk

chunk(
    source: str | Path,
    *,
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_lines: Annotated[int | None, Field(ge=5)] = None,
    max_functions: Annotated[
        int | None, Field(ge=1)
    ] = None,
    token_counter: Callable[[str], int] | None = None,
    include_comments: bool = True,
    docstring_mode: Literal[
        "summary", "all", "excluded"
    ] = "all",
    strict: bool = True
) -> list[Box]

Extract semantic code chunks from source using multi-dimensional analysis.

Processes source code by identifying structural boundaries (functions, classes, namespaces) and grouping content based on multiple constraints including tokens, lines, and logical units while preserving semantic coherence.

Parameters:

  • source

    (str | Path) –

    Raw code string or file path to process.

  • max_tokens

    (int, default: None ) –

    Maximum tokens per chunk. Must be >= 12.

  • max_lines

    (int, default: None ) –

    Maximum number of lines per chunk. Must be >= 5.

  • max_functions

    (int, default: None ) –

    Maximum number of functions per chunk. Must be >= 1.

  • token_counter

    (Callable, default: None ) –

    Token counting function. Uses instance counter if None. Required for token-based chunking.

  • include_comments

    (bool, default: True ) –

    Include comments in output chunks. Default: True.

  • docstring_mode

    (Literal['summary', 'all', 'excluded'], default: 'all' ) –

    Docstring processing strategy: - "summary": Include only first line of docstrings - "all": Include complete docstrings - "excluded": Remove all docstrings Defaults to "all"

  • strict

    (bool, default: True ) –

    If True, raise error when structural blocks exceed max_tokens. If False, split oversized blocks. Default: True.

Returns:

  • list[Box]

    list[Box]: List of code chunks with metadata. Each Box contains: - content (str): Code content - tree (str): Namespace hierarchy - start_line (int): Starting line in original source - end_line (int): Ending line in original source - span (tuple[int, int]): Character-level span (start and end offsets) in the original source. - source_path (str): Source file path or "N/A"

Raises:

Source code in src/chunklet/code_chunker/code_chunker.py
@validate_input
def chunk(
    self,
    source: str | Path,
    *,
    max_tokens: Annotated[int | None, Field(ge=12)] = None,
    max_lines: Annotated[int | None, Field(ge=5)] = None,
    max_functions: Annotated[int | None, Field(ge=1)] = None,
    token_counter: Callable[[str], int] | None = None,
    include_comments: bool = True,
    docstring_mode: Literal["summary", "all", "excluded"] = "all",
    strict: bool = True,
) -> list[Box]:
    """
    Extract semantic code chunks from source using multi-dimensional analysis.

    Processes source code by identifying structural boundaries (functions, classes,
    namespaces) and grouping content based on multiple constraints including
    tokens, lines, and logical units while preserving semantic coherence.

    Args:
        source (str | Path): Raw code string or file path to process.
        max_tokens (int, optional): Maximum tokens per chunk. Must be >= 12.
        max_lines (int, optional): Maximum number of lines per chunk. Must be >= 5.
        max_functions (int, optional): Maximum number of functions per chunk. Must be >= 1.
        token_counter (Callable, optional): Token counting function. Uses instance
            counter if None. Required for token-based chunking.
        include_comments (bool): Include comments in output chunks. Default: True.
        docstring_mode(Literal["summary", "all", "excluded"]): Docstring processing strategy:
            - "summary": Include only first line of docstrings
            - "all": Include complete docstrings
            - "excluded": Remove all docstrings
            Defaults to "all"
        strict (bool): If True, raise error when structural blocks exceed
            max_tokens. If False, split oversized blocks. Default: True.

    Returns:
        list[Box]: List of code chunks with metadata. Each Box contains:
            - content (str): Code content
            - tree (str): Namespace hierarchy
            - start_line (int): Starting line in original source
            - end_line (int): Ending line in original source
            - span (tuple[int, int]): Character-level span (start and end offsets) in the original source.
            - source_path (str): Source file path or "N/A"

    Raises:
        InvalidInputError: Invalid configuration parameters.
        MissingTokenCounterError: No token counter available.
        FileProcessingError: Source file cannot be read.
        TokenLimitError: Structural block exceeds max_tokens in strict mode.
        CallbackError: If the token counter fails or returns an invalid type.
    """
    self._validate_constraints(max_tokens, max_lines, max_functions, token_counter)

    # Adjust limits for internal use
    if max_tokens is None:
        max_tokens = sys.maxsize
    if max_lines is None:
        max_lines = sys.maxsize
    if max_functions is None:
        max_functions = sys.maxsize

    token_counter = token_counter or self.token_counter

    if isinstance(source, str) and not source.strip():
        self.log_info("Input source is empty. Returning empty list.")
        return []

    self.log_info(
        "Starting chunk processing for {}",
        (
            f"source: {source}"
            if isinstance(source, Path)
            or (isinstance(source, str) and is_path_like(source))
            else f"code starting with:\n```\n{source[:100]}...\n```\n"
        ),
    )

    snippet_dicts, cumulative_lengths = self.extractor.extract_code_structure(
        source, include_comments, docstring_mode
    )

    result_chunks = self._group_by_chunk(
        snippet_dicts=snippet_dicts,
        cumulative_lengths=cumulative_lengths,
        token_counter=token_counter,
        max_tokens=max_tokens,
        max_lines=max_lines,
        max_functions=max_functions,
        strict=strict,
        source=source,
    )

    self.log_info(
        "Generated {} chunk(s) for the {}",
        len(result_chunks),
        (
            f"source: {source}"
            if isinstance(source, Path)
            or (isinstance(source, str) and is_path_like(source))
            else f"code starting with:\n```\n{source[:100]}...\n```\n"
        ),
    )

    return result_chunks