chunklet.sentence_splitter._fallback_splitter
Classes:
-
FallbackSplitter–Rule-based, language-agnostic sentence boundary detector.
FallbackSplitter
Rule-based, language-agnostic sentence boundary detector.
A rule-based, sentence boundary detection tool that doesn't rely on hardcoded lists of abbreviations or sentence terminators, making it adaptable to various text formats and domains.
FallbackSplitter uses regex patterns to split text into sentences, handling: - Common sentence-ending punctuation (., !, ?) - Abbreviations and acronyms (e.g., Dr., Ph.D., U.S.) - Numbered lists and headings - Multi-punctuation sequences (e.g., ! ! !, ?!) - Line breaks and whitespace normalization - Decimal numbers and inline numbers
Sentences are conservatively segmented, prioritizing context over aggressive splitting, which reduces false splits inside abbreviations, multi-punctuation sequences, or numeric constructs.
Initializes regex patterns for sentence splitting.
Methods:
-
split–Splits text into sentences using rule-based regex patterns.
Source code in src/chunklet/sentence_splitter/_fallback_splitter.py
split
Splits text into sentences using rule-based regex patterns.
Parameters:
-
(textstr) –The input text to be segmented into sentences.
Returns:
-
List[str]–List[str]: A list of sentences after segmentation.
Notes
- Normalizes numbered lists during splitting and restores them afterward.
- Handles punctuation, newlines, and common edge cases.