Skip to content

chunklet.code_chunker._code_structure_extractor

Internal module for extracting code structures from source code files.

Provides functionality to parse and analyze code syntax trees, identifying functions, classes, namespaces, and other structural elements. This module is used by CodeChunker to understand code structure before splitting into chunks.

Classes:

CodeStructureExtractor

CodeStructureExtractor(verbose: bool = False)

Extracts structural units from source code.

This class provides functionality to parse source code files and identify functions, classes, namespaces, and other structural elements using a language-agnostic approach.

Methods:

Source code in src/chunklet/code_chunker/_code_structure_extractor.py
@validate_input
def __init__(self, verbose: bool = False):
    self.verbose = verbose

extract_code_structure

extract_code_structure(
    code: str,
    include_comments: bool,
    docstring_mode: str,
    is_python_code: bool = False,
) -> tuple[list[dict], tuple[int, ...]]

Preprocess and parse code into individual snippet boxes.

This function-first extraction identifies functions as primary units while implicitly handling other structures within the function context.

Parameters:

  • code

    (str) –

    Raw code string.

  • include_comments

    (bool) –

    Whether to include comments in output.

  • docstring_mode

    (Literal[summary, all, excluded]) –

    How to handle docstrings.

  • is_python_code

    (bool, default: False ) –

    Whether the code is Python.

Returns:

  • tuple[list[dict], tuple[int, ...]]

    tuple[list[dict], tuple[int, ...]]: A tuple containing the list of extracted code structure boxes and the line lengths.

Source code in src/chunklet/code_chunker/_code_structure_extractor.py
def extract_code_structure(
    self,
    code: str,
    include_comments: bool,
    docstring_mode: str,
    is_python_code: bool = False,
) -> tuple[list[dict], tuple[int, ...]]:
    """
    Preprocess and parse code into individual snippet boxes.

    This function-first extraction identifies functions as primary units
    while implicitly handling other structures within the function context.

    Args:
        code (str): Raw code string.
        include_comments (bool): Whether to include comments in output.
        docstring_mode (Literal["summary", "all", "excluded"]): How to handle docstrings.
        is_python_code (bool): Whether the code is Python.

    Returns:
        tuple[list[dict], tuple[int, ...]]: A tuple containing the list of extracted code structure boxes and the line lengths.
    """
    if not code:
        return [], ()

    code, cumulative_lengths = self._preprocess(
        code, include_comments, docstring_mode
    )

    state = {
        "curr_struct": [],
        "block_indent_level": 0,
        "snippet_dicts": [],
    }
    buffer = defaultdict(list)

    for line_no, line in enumerate(code.splitlines(), start=1):
        indent_level = len(line) - len(line.lstrip())

        # Detect annotated lines
        matched = re.search(r"\(-- ([A-Z]+) -->\) ", line)
        if matched:
            self._handle_annotated_line(
                line=line,
                line_no=line_no,
                matched=matched,
                buffer=buffer,
                state=state,
            )
            continue

        if buffer["STR"]:
            self._flush_snippet([], state["snippet_dicts"], buffer)

        # -- Manage block accumulation logic--

        func_start = FUNCTION_DECLARATION.match(line)
        func_start = func_start.group(0) if func_start else None

        if not state["curr_struct"]:  # Fresh block
            state["curr_struct"] = [
                CodeLine(line_no, line, indent_level, func_start)
            ]
            state["block_indent_level"] = indent_level
            continue

        # Block start triggered by functions or namespaces indentification
        # You might think it is in the wrong place, but it isnt
        self._handle_block_start(
            line=line,
            indent_level=indent_level,
            buffer=buffer,
            state=state,
            code=code,
            func_start=func_start,
            is_python_code=is_python_code,
        )

        if (
            line.strip()
            and indent_level <= state["block_indent_level"]
            and not (OPENER.match(line) or CLOSER.match(line))
        ):  # Block end
            state["block_indent_level"] = indent_level
            self._flush_snippet(
                state["curr_struct"], state["snippet_dicts"], buffer
            )

        state["curr_struct"].append(
            CodeLine(line_no, line, indent_level, func_start)
        )

    # Append last snippet
    if state["curr_struct"]:
        self._flush_snippet(state["curr_struct"], state["snippet_dicts"], buffer)

    snippet_dicts = self._post_processing(state["snippet_dicts"])
    log_info(
        self.verbose, "Extracted {} structural blocks from code", len(snippet_dicts)
    )

    return snippet_dicts, cumulative_lengths