chunklet.document_chunker.span_finder
Classes:
-
DeterministicSpanFinder–Find a substring span within full text, ignoring non-alphanumeric characters.
DeterministicSpanFinder
Find a substring span within full text, ignoring non-alphanumeric characters.
This is a deterministic alternative to regex-based span finding, providing ~2x performance improvement by avoiding backtracking and complex pattern matching.
Initialize the span finder.
Parameters:
-
(textstr) –The full text to search within.
Methods:
-
find_span–Find the start and end indices of a substring within the original text.
Source code in src/chunklet/document_chunker/span_finder.py
find_span
Find the start and end indices of a substring within the original text.
The search is performed in two stages: 1. Exact match on the original text. 2. Normalized alphanumeric match if exact match fails.
Parameters:
-
(textstr) –The query substring.
Returns:
-
int–A tuple consists of start and end indexes in the original text.
-
int–(-1, -1) is returned if no match is found.