chunklet.plain_text_chunker
Classes:
-
PlainTextChunker–A powerful text chunking utility offering flexible strategies for optimal text segmentation.
PlainTextChunker
PlainTextChunker(
sentence_splitter: Any | None = None,
verbose: bool = False,
continuation_marker: str = "...",
token_counter: Callable[[str], int] | None = None,
)
A powerful text chunking utility offering flexible strategies for optimal text segmentation.
Key Features: - Flexible Constraint-Based Chunking: Segment text by specifying limits on sentence count, token count and section breaks or combination of them. - Clause-Level Overlap: Ensures semantic continuity between chunks by overlapping at natural clause boundaries with Customizable continuation marker. - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Pluggable Token Counters: Integrate custom token counting functions (e.g., for specific LLM tokenizers). - Parallel Processing: Efficiently handles batch chunking of multiple texts using multiprocessing. - Memory friendly batching: Yields chunks one at a time, reducing memory usage, especially for very large documents.
Initialize The PlainTextChunker.
Parameters:
-
(sentence_splitterBaseSplitter, default:None) –An optional BaseSplitter instance. If None, a default SentenceSplitter will be initialized.
-
(verbosebool, default:False) –Enable verbose logging.
-
(continuation_markerstr, default:'...') –The marker to prepend to unfitted clauses. Defaults to '...'.
-
(token_counterCallable[[str], int], default:None) –Function that counts tokens in text. If None, must be provided when calling chunk() methods.
Raises:
-
InvalidInputError–If any of the input arguments are invalid or if the provided
sentence_splitteris not an instance ofBaseSplitter.
Methods:
-
batch_chunk–Processes a batch of texts in parallel, splitting each into chunks.
-
chunk–Chunks a single text into smaller pieces based on specified parameters.
Attributes:
-
verbose(bool) –Get the verbosity status.
Source code in src/chunklet/plain_text_chunker.py
batch_chunk
batch_chunk(
texts: restricted_iterable(str),
*,
lang: str = "auto",
max_tokens: Annotated[int | None, Field(ge=12)] = None,
max_sentences: Annotated[
int | None, Field(ge=1)
] = None,
max_section_breaks: Annotated[
int | None, Field(ge=1)
] = None,
overlap_percent: Annotated[
int, Field(ge=0, le=75)
] = 20,
offset: Annotated[int, Field(ge=0)] = 0,
token_counter: Callable[[str], int] | None = None,
separator: Any = None,
base_metadata: dict[str, Any] | None = None,
n_jobs: Annotated[int, Field(ge=1)] | None = None,
show_progress: bool = True,
on_errors: Literal["raise", "skip", "break"] = "raise"
) -> Generator[Any, None, None]
Processes a batch of texts in parallel, splitting each into chunks. Leverages multiprocessing for efficient batch chunking.
If a task fails, chunklet will now stop processing and return the results
of the tasks that completed successfully, preventing wasted work.
Parameters:
-
(textsrestricted_iterable[str]) –A restricted iterable of input texts to be chunked.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
-
(max_tokensint, default:None) –Maximum number of tokens per chunk. Must be >= 12.
-
(max_sentencesint, default:None) –Maximum number of sentences per chunk. Must be >= 1.
-
(max_section_breaksint, default:None) –Maximum number of section breaks per chunk. Must be >= 1.
-
(overlap_percentint | float, default:20) –Percentage of overlap between chunks (0-85).
-
(offsetint, default:0) –Starting sentence offset for chunking. Defaults to 0.
-
(token_countercallable, default:None) –The token counting function. Required if
max_tokensis set. -
(separatorAny, default:None) –A value to be yielded after the chunks of each text are processed. Note: None cannot be used as a separator.
-
(base_metadatadict[str, Any], default:None) –Optional dictionary to be included with each chunk.
-
(n_jobsint | None, default:None) –Number of parallel workers to use. If None, uses all available CPUs. Must be >= 1 if specified.
-
(show_progressbool, default:True) –Flag to show or disable the loading bar.
-
(on_errorsLiteral['raise', 'skip', 'break'], default:'raise') –How to handle errors during processing. Defaults to 'raise'.
Yields:
-
Any(Any) –A
Boxobject containing the chunk content and metadata, or any separator object.
Raises:
-
InvalidInputError–If
textsis not an iterable of strings, or ifn_jobsis less than 1. -
MissingTokenCounterError–If
max_tokensis provided but notoken_counteris provided. -
CallbackError–If an error occurs during sentence splitting or token counting within a chunking task.
Source code in src/chunklet/plain_text_chunker.py
chunk
chunk(
text: str,
*,
lang: str = "auto",
max_tokens: Annotated[int | None, Field(ge=12)] = None,
max_sentences: Annotated[
int | None, Field(ge=1)
] = None,
max_section_breaks: Annotated[
int | None, Field(ge=1)
] = None,
overlap_percent: Annotated[
int, Field(ge=0, le=75)
] = 20,
offset: Annotated[int, Field(ge=0)] = 0,
token_counter: Callable[[str], int] | None = None,
base_metadata: dict[str, Any] | None = None
) -> list[Box]
Chunks a single text into smaller pieces based on specified parameters. Supports flexible constraint-based chunking, clause-level overlap, and custom token counters.
Parameters:
-
(textstr) –The input text to chunk.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
-
(max_tokensint, default:None) –Maximum number of tokens per chunk. Must be >= 12.
-
(max_sentencesint, default:None) –Maximum number of sentences per chunk. Must be >= 1.
-
(max_section_breaksint, default:None) –Maximum number of section breaks per chunk. Must be >= 1.
-
(overlap_percentint | float, default:20) –Percentage of overlap between chunks (0-75). Defaults to 20
-
(offsetint, default:0) –Starting sentence offset for chunking. Defaults to 0.
-
(token_countercallable, default:None) –Optional token counting function. Required for token-based modes only.
-
(base_metadatadict[str, Any], default:None) –Optional dictionary to be included with each chunk.
Returns:
-
list[Box]–list[Box]: A list of
Boxobjects, each containing the chunk content and metadata.
Raises:
-
InvalidInputError–If any chunking configuration parameter is invalid.
-
MissingTokenCounterError–If
max_tokensis provided but notoken_counteris provided. -
CallbackError–If an error occurs during sentence splitting or token counting within a chunking task.
Source code in src/chunklet/plain_text_chunker.py
387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 | |