chunklet.code_chunker.code_chunker
Author: Speedyk-005 | Copyright (c) 2025 | License: MIT
Language-Agnostic Code Chunking Utility
This module provides a robust, convention-aware engine for segmenting source code into
semantic units ("chunks") such as functions, classes, namespaces, and logical blocks.
Unlike purely heuristic or grammar-dependent parsers, the CodeChunker relies on
anchored, multi-language regex patterns and indentation rules to identify structures
consistently across a variety of programming languages.
Limitations
CodeChunker assumes syntactically conventional code. Highly obfuscated, minified,
or macro-generated sources may not fully respect its boundary patterns, though such
cases fall outside its intended domain.
Inspired by
- Camel.utils.chunker.CodeChunker (@ CAMEL-AI.org)
- code-chunker by JimAiMoment
- whats_that_code by matthewdeanmartin
- CintraAI Code Chunker
Classes:
-
CodeChunker–Language-agnostic code chunking utility for semantic code segmentation.
CodeChunker
Bases: BaseChunker
Language-agnostic code chunking utility for semantic code segmentation.
Extracts structural units (functions, classes, namespaces) from source code across multiple programming languages using pattern-based detection and token-aware segmentation.
Key Features
- Cross-language support (Python, C/C++, Java, C#, JavaScript, Go, etc.)
- Structural analysis with namespace hierarchy tracking
- Configurable token limits with strict/lenient overflow handling
- Flexible docstring and comment processing modes
- Accurate line number preservation and source tracking
- Parallel batch processing for multiple files
- Comprehensive logging and progress tracking
Initialize the CodeChunker with optional token counter and verbosity control.
Parameters:
-
(verbosebool, default:False) –Enable verbose logging.
-
(token_counterCallable[[str], int] | None, default:None) –Function that counts tokens in text. If None, must be provided when calling chunk() methods.
Methods:
-
batch_chunk–Process multiple source files or code strings in parallel.
-
chunk–Extract semantic code chunks from source using multi-dimensional analysis.
Attributes:
-
verbose(bool) –Get the verbose setting.
Source code in src/chunklet/code_chunker/code_chunker.py
batch_chunk
batch_chunk(
sources: restricted_iterable(str | Path),
*,
max_tokens: Annotated[int | None, Field(ge=12)] = None,
max_lines: Annotated[int | None, Field(ge=5)] = None,
max_functions: Annotated[
int | None, Field(ge=1)
] = None,
token_counter: Callable[[str], int] | None = None,
separator: Any = None,
include_comments: bool = True,
docstring_mode: Literal[
"summary", "all", "excluded"
] = "all",
strict: bool = True,
n_jobs: Annotated[int, Field(ge=1)] | None = None,
show_progress: bool = True,
on_errors: Literal["raise", "skip", "break"] = "raise"
) -> Generator[Box, None, None]
Process multiple source files or code strings in parallel.
Leverages multiprocessing to efficiently chunk multiple code sources, applying consistent chunking rules across all inputs.
Parameters:
-
(sourcesrestricted_iterable[str | Path]) –A restricted iterable of file paths or raw code strings to process.
-
(max_tokensint, default:None) –Maximum tokens per chunk. Must be >= 12.
-
(max_linesint, default:None) –Maximum number of lines per chunk. Must be >= 5.
-
(max_functionsint, default:None) –Maximum number of functions per chunk. Must be >= 1.
-
(token_counterCallable | None, default:None) –Token counting function. Uses instance counter if None. Required for token-based chunking.
-
(separatorAny, default:None) –A value to be yielded after the chunks of each text are processed. Note: None cannot be used as a separator.
-
(include_commentsbool, default:True) –Include comments in output chunks. Default: True.
-
(docstring_modeLiteral['summary', 'all', 'excluded'], default:'all') –Docstring processing strategy: - "summary": Include only first line of docstrings - "all": Include complete docstrings - "excluded": Remove all docstrings Defaults to "all"
-
(strictbool, default:True) –If True, raise error when structural blocks exceed max_tokens. If False, split oversized blocks. Default: True.
-
(n_jobsint | None, default:None) –Number of parallel workers. Uses all available CPUs if None.
-
(show_progressbool, default:True) –Display progress bar during processing. Defaults to True.
-
(on_errorsLiteral['raise', 'skip', 'break'], default:'raise') –How to handle errors during processing. Defaults to 'raise'.
Yields:
-
Box(Box) –Boxobject, representing a chunk with its content and metadata. Includes: - content (str): Code content - tree (str): Namespace hierarchy - start_line (int): Starting line in original source - end_line (int): Ending line in original source - span (tuple[int, int]): Character-level span (start and end offsets) in the original source. - source_path (str): Source file path or "N/A"
Raises:
-
InvalidInputError–Invalid input parameters.
-
MissingTokenCounterError–No token counter available.
-
FileProcessingError–Source file cannot be read.
-
TokenLimitError–Structural block exceeds max_tokens in strict mode.
-
CallbackError–If the token counter fails or returns an invalid type.
Source code in src/chunklet/code_chunker/code_chunker.py
578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 | |
chunk
chunk(
source: str | Path,
*,
max_tokens: Annotated[int | None, Field(ge=12)] = None,
max_lines: Annotated[int | None, Field(ge=5)] = None,
max_functions: Annotated[
int | None, Field(ge=1)
] = None,
token_counter: Callable[[str], int] | None = None,
include_comments: bool = True,
docstring_mode: Literal[
"summary", "all", "excluded"
] = "all",
strict: bool = True
) -> list[Box]
Extract semantic code chunks from source using multi-dimensional analysis.
Processes source code by identifying structural boundaries (functions, classes, namespaces) and grouping content based on multiple constraints including tokens, lines, and logical units while preserving semantic coherence.
Parameters:
-
(sourcestr | Path) –Raw code string or file path to process.
-
(max_tokensint, default:None) –Maximum tokens per chunk. Must be >= 12.
-
(max_linesint, default:None) –Maximum number of lines per chunk. Must be >= 5.
-
(max_functionsint, default:None) –Maximum number of functions per chunk. Must be >= 1.
-
(token_counterCallable, default:None) –Token counting function. Uses instance counter if None. Required for token-based chunking.
-
(include_commentsbool, default:True) –Include comments in output chunks. Default: True.
-
(docstring_modeLiteral['summary', 'all', 'excluded'], default:'all') –Docstring processing strategy: - "summary": Include only first line of docstrings - "all": Include complete docstrings - "excluded": Remove all docstrings Defaults to "all"
-
(strictbool, default:True) –If True, raise error when structural blocks exceed max_tokens. If False, split oversized blocks. Default: True.
Returns:
-
list[Box]–list[Box]: List of code chunks with metadata. Each Box contains: - content (str): Code content - tree (str): Namespace hierarchy - start_line (int): Starting line in original source - end_line (int): Ending line in original source - span (tuple[int, int]): Character-level span (start and end offsets) in the original source. - source_path (str): Source file path or "N/A"
Raises:
-
InvalidInputError–Invalid configuration parameters.
-
MissingTokenCounterError–No token counter available.
-
FileProcessingError–Source file cannot be read.
-
TokenLimitError–Structural block exceeds max_tokens in strict mode.
-
CallbackError–If the token counter fails or returns an invalid type.
Source code in src/chunklet/code_chunker/code_chunker.py
472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 | |