chunklet.plain_text_chunker
Classes:
-
PlainTextChunker–A powerful text chunking utility offering flexible strategies for optimal text segmentation.
PlainTextChunker
PlainTextChunker(
sentence_splitter: Any | None = None,
verbose: bool = False,
continuation_marker: str = "...",
token_counter: Callable[[str], int] | None = None,
)
Bases: BaseChunker
A powerful text chunking utility offering flexible strategies for optimal text segmentation.
Key Features: - Flexible Constraint-Based Chunking: Segment text by specifying limits on sentence count, token count and section breaks or combination of them. - Clause-Level Overlap: Ensures semantic continuity between chunks by overlapping at natural clause boundaries with Customizable continuation marker. - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Pluggable Token Counters: Integrate custom token counting functions (e.g., for specific LLM tokenizers). - Parallel Processing: Efficiently handles batch chunking of multiple texts using multiprocessing. - Memory friendly batching: Yields chunks one at a time, reducing memory usage, especially for very large documents.
Initialize The PlainTextChunker.
Parameters:
-
(sentence_splitterBaseSplitter, default:None) –An optional BaseSplitter instance. If None, a default SentenceSplitter will be initialized.
-
(verbosebool, default:False) –Enable verbose logging.
-
(continuation_markerstr, default:'...') –The marker to prepend to unfitted clauses. Defaults to '...'.
-
(token_counterCallable[[str], int], default:None) –Function that counts tokens in text. If None, must be provided when calling chunk() methods.
Raises:
-
InvalidInputError–If any of the input arguments are invalid or if the provided
sentence_splitteris not an instance ofBaseSplitter.
Methods:
-
batch_chunk–Processes a batch of texts in parallel, splitting each into chunks.
-
chunk–Chunks a single text into smaller pieces based on specified parameters.
Attributes:
-
verbose(bool) –Get the verbosity status.
Source code in src/chunklet/plain_text_chunker.py
batch_chunk
batch_chunk(
texts: restricted_iterable(str),
*,
lang: str = "auto",
max_tokens: Annotated[int | None, Field(ge=12)] = None,
max_sentences: Annotated[
int | None, Field(ge=1)
] = None,
max_section_breaks: Annotated[
int | None, Field(ge=1)
] = None,
overlap_percent: Annotated[
int, Field(ge=0, le=75)
] = 20,
offset: Annotated[int, Field(ge=0)] = 0,
token_counter: Callable[[str], int] | None = None,
separator: Any = None,
base_metadata: dict[str, Any] | None = None,
n_jobs: Annotated[int, Field(ge=1)] | None = None,
show_progress: bool = True,
on_errors: Literal["raise", "skip", "break"] = "raise"
) -> Generator[Any, None, None]
Processes a batch of texts in parallel, splitting each into chunks. Leverages multiprocessing for efficient batch chunking.
If a task fails, chunklet will now stop processing and return the results
of the tasks that completed successfully, preventing wasted work.
Parameters:
-
(textsrestricted_iterable[str]) –A restricted iterable of input texts to be chunked.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
-
(max_tokensint, default:None) –Maximum number of tokens per chunk. Must be >= 12.
-
(max_sentencesint, default:None) –Maximum number of sentences per chunk. Must be >= 1.
-
(max_section_breaksint, default:None) –Maximum number of section breaks per chunk. Must be >= 1.
-
(overlap_percentint | float, default:20) –Percentage of overlap between chunks (0-85).
-
(offsetint, default:0) –Starting sentence offset for chunking. Defaults to 0.
-
(token_countercallable, default:None) –The token counting function. Required if
max_tokensis set. -
(separatorAny, default:None) –A value to be yielded after the chunks of each text are processed. Note: None cannot be used as a separator.
-
(base_metadatadict[str, Any], default:None) –Optional dictionary to be included with each chunk.
-
(n_jobsint | None, default:None) –Number of parallel workers to use. If None, uses all available CPUs. Must be >= 1 if specified.
-
(show_progressbool, default:True) –Flag to show or disable the loading bar.
-
(on_errorsLiteral['raise', 'skip', 'break'], default:'raise') –How to handle errors during processing. Defaults to 'raise'.
Yields:
-
Any(Any) –A
Boxobject containing the chunk content and metadata, or any separator object.
Raises:
-
InvalidInputError–If
textsis not an iterable of strings, or ifn_jobsis less than 1. -
MissingTokenCounterError–If
max_tokensis provided but notoken_counteris provided. -
CallbackError–If an error occurs during sentence splitting or token counting within a chunking task.
Source code in src/chunklet/plain_text_chunker.py
chunk
chunk(
text: str,
*,
lang: str = "auto",
max_tokens: Annotated[int | None, Field(ge=12)] = None,
max_sentences: Annotated[
int | None, Field(ge=1)
] = None,
max_section_breaks: Annotated[
int | None, Field(ge=1)
] = None,
overlap_percent: Annotated[
int, Field(ge=0, le=75)
] = 20,
offset: Annotated[int, Field(ge=0)] = 0,
token_counter: Callable[[str], int] | None = None,
base_metadata: dict[str, Any] | None = None
) -> list[Box]
Chunks a single text into smaller pieces based on specified parameters. Supports flexible constraint-based chunking, clause-level overlap, and custom token counters.
Parameters:
-
(textstr) –The input text to chunk.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
-
(max_tokensint, default:None) –Maximum number of tokens per chunk. Must be >= 12.
-
(max_sentencesint, default:None) –Maximum number of sentences per chunk. Must be >= 1.
-
(max_section_breaksint, default:None) –Maximum number of section breaks per chunk. Must be >= 1.
-
(overlap_percentint | float, default:20) –Percentage of overlap between chunks (0-75). Defaults to 20
-
(offsetint, default:0) –Starting sentence offset for chunking. Defaults to 0.
-
(token_countercallable, default:None) –Optional token counting function. Required for token-based modes only.
-
(base_metadatadict[str, Any], default:None) –Optional dictionary to be included with each chunk.
Returns:
-
list[Box]–list[Box]: A list of
Boxobjects, each containing the chunk content and metadata.
Raises:
-
InvalidInputError–If any chunking configuration parameter is invalid.
-
MissingTokenCounterError–If
max_tokensis provided but notoken_counteris provided. -
CallbackError–If an error occurs during sentence splitting or token counting within a chunking task.
Source code in src/chunklet/plain_text_chunker.py
485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 | |