Skip to content

chunklet.document_chunker.span_finder

Classes:

DeterministicSpanFinder

DeterministicSpanFinder(text: str)

Find a substring span within full text, ignoring non-alphanumeric characters.

This is a deterministic alternative to regex-based span finding, providing ~2x performance improvement by avoiding backtracking and complex pattern matching.

Initialize the span finder.

Parameters:

  • text

    (str) –

    The full text to search within.

Methods:

  • find_span

    Find the start and end indices of a substring within the original text.

Source code in src/chunklet/document_chunker/span_finder.py
def __init__(self, text: str):
    """
    Initialize the span finder.

    Args:
        text: The full text to search within.
    """
    self.full_text = text
    self.cleaned_full_text, self.index_map = self._build_index_map(text)

find_span

find_span(text: str) -> tuple[int, int]

Find the start and end indices of a substring within the original text.

The search is performed in two stages: 1. Exact match on the original text. 2. Normalized alphanumeric match if exact match fails.

Parameters:

  • text

    (str) –

    The query substring.

Returns:

  • int

    A tuple consists of start and end indexes in the original text.

  • int

    (-1, -1) is returned if no match is found.

Source code in src/chunklet/document_chunker/span_finder.py
def find_span(self, text: str) -> tuple[int, int]:
    """
    Find the start and end indices of a substring within the original text.

    The search is performed in two stages:
    1. Exact match on the original text.
    2. Normalized alphanumeric match if exact match fails.

    Args:
        text: The query substring.

    Returns:
        A tuple consists of start and end indexes in the original text.
        (-1, -1) is returned if no match is found.
    """
    stripped = text.strip()

    if stripped in self.full_text:
        start = self.full_text.find(stripped)
        return start, start + len(stripped)

    cleaned_text = "".join(ch for ch in text if ch.isalnum())

    if cleaned_text in self.cleaned_full_text:
        pos = self.cleaned_full_text.find(cleaned_text.strip())
        if pos >= 0:
            start = self.index_map[pos]
            end = start + len(cleaned_text) + 1
            return start, end

    return -1, -1