chunklet.document_chunker._plain_text_chunker
Classes:
-
PlainTextChunker–A powerful text chunking utility offering flexible strategies for optimal text segmentation.
PlainTextChunker
PlainTextChunker(
sentence_splitter: Any | None = None,
verbose: bool = False,
continuation_marker: str = "...",
token_counter: Callable[[str], int] | None = None,
)
A powerful text chunking utility offering flexible strategies for optimal text segmentation.
Key Features
- Flexible Constraint-Based Chunking: Segment text by specifying limits on sentence count, token count and section breaks or combination of them.
- Clause-Level Overlap: Ensures semantic continuity between chunks by overlapping
at natural clause boundaries with Customizable continuation marker. - Multilingual Support: Leverages language-specific algorithms and detection for broad coverage. - Pluggable Token Counters: Integrate custom token counting functions (e.g., for specific LLM tokenizers). - Parallel Processing: Efficiently handles batch chunking of multiple texts using multiprocessing. - Memory friendly batching: Yields chunks one at a time, reducing memory usage, especially for very large documents.
Initialize The PlainTextChunker.
Parameters:
-
(sentence_splitterAny | None, default:None) –An optional BaseSplitter instance. If None, a default SentenceSplitter will be initialized.
-
(verbosebool, default:False) –Enable verbose logging.
-
(continuation_markerstr, default:'...') –The marker to prepend to unfitted clauses. Defaults to '...'.
-
(token_counterCallable[[str], int] | None, default:None) –Function that counts tokens in text. If None, must be provided when calling chunk() methods.
Raises:
-
InvalidInputError–If any of the input arguments are invalid or if the provided
sentence_splitteris not an instance ofBaseSplitter.
Methods:
-
batch_chunk–Processes a batch of texts in parallel, splitting each into chunks.
-
chunk–Chunks a single text into smaller pieces based on specified parameters.
Attributes:
-
verbose(bool) –Get the verbosity status.
Source code in src/chunklet/document_chunker/_plain_text_chunker.py
batch_chunk
batch_chunk(
texts: IterableOfStr,
*,
lang: str = "auto",
max_tokens: Annotated[int | None, Field(ge=12)] = None,
max_sentences: Annotated[
int | None, Field(ge=1)
] = None,
max_section_breaks: Annotated[
int | None, Field(ge=1)
] = None,
overlap_percent: Annotated[
int, Field(ge=0, le=75)
] = 20,
offset: Annotated[int, Field(ge=0)] = 0,
token_counter: Callable[[str], int] | None = None,
separator: Any = None,
base_metadata: dict[str, Any] | None = None,
n_jobs: Annotated[int, Field(ge=1)] | None = None,
show_progress: bool = True,
on_errors: Literal["raise", "skip", "break"] = "raise",
) -> Generator[Any, None, None]
Processes a batch of texts in parallel, splitting each into chunks. Leverages multiprocessing for efficient batch chunking.
If a task fails, chunklet will now stop processing and return the results
of the tasks that completed successfully, preventing wasted work.
Parameters:
-
(textsIterableOfStr) –A non-string iterable of input texts to be chunked.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
-
(max_tokensAnnotated[int | None, Field(ge=12)], default:None) –Maximum number of tokens per chunk. Must be >= 12.
-
(max_sentencesAnnotated[int | None, Field(ge=1)], default:None) –Maximum number of sentences per chunk. Must be >= 1.
-
(max_section_breaksAnnotated[int | None, Field(ge=1)], default:None) –Maximum number of section breaks per chunk. Must be >= 1.
-
(overlap_percentAnnotated[int, Field(ge=0, le=75)], default:20) –Percentage of overlap between chunks (0-85).
-
(offsetAnnotated[int, Field(ge=0)], default:0) –Starting sentence offset for chunking. Defaults to 0.
-
(token_counterCallable[[str], int] | None, default:None) –The token counting function. Required if
max_tokensis set. -
(separatorAny, default:None) –A value to be yielded after the chunks of each text are processed. Note: None cannot be used as a separator.
-
(base_metadatadict[str, Any] | None, default:None) –Optional dictionary to be included with each chunk.
-
(n_jobsAnnotated[int, Field(ge=1)] | None, default:None) –Number of parallel workers to use. If None, uses all available CPUs. Must be >= 1 if specified.
-
(show_progressbool, default:True) –Flag to show or disable the loading bar.
-
(on_errorsLiteral['raise', 'skip', 'break'], default:'raise') –How to handle errors during processing. Defaults to 'raise'.
Yields:
-
Any–A
DotDictobject containing the chunk content and metadata, or any separator object.
Raises:
-
InvalidInputError–If
textsis not an iterable of strings, or ifn_jobsis less than 1. -
MissingTokenCounterError–If
max_tokensis provided but notoken_counteris provided. -
CallbackError–If an error occurs during sentence splitting or token counting within a chunking task.
Source code in src/chunklet/document_chunker/_plain_text_chunker.py
chunk
chunk(
text: str,
*,
lang: str = "auto",
max_tokens: Annotated[int | None, Field(ge=12)] = None,
max_sentences: Annotated[
int | None, Field(ge=1)
] = None,
max_section_breaks: Annotated[
int | None, Field(ge=1)
] = None,
overlap_percent: Annotated[
int, Field(ge=0, le=75)
] = 20,
offset: Annotated[int, Field(ge=0)] = 0,
token_counter: Callable[[str], int] | None = None,
base_metadata: dict[str, Any] | None = None,
) -> list[DotDict]
Chunks a single text into smaller pieces based on specified parameters. Supports flexible constraint-based chunking, clause-level overlap, and custom token counters.
Parameters:
-
(textstr) –The input text to chunk.
-
(langstr, default:'auto') –The language of the text (e.g., 'en', 'fr', 'auto'). Defaults to "auto".
-
(max_tokensAnnotated[int | None, Field(ge=12)], default:None) –Maximum number of tokens per chunk. Must be >= 12.
-
(max_sentencesAnnotated[int | None, Field(ge=1)], default:None) –Maximum number of sentences per chunk. Must be >= 1.
-
(max_section_breaksAnnotated[int | None, Field(ge=1)], default:None) –Maximum number of section breaks per chunk. Must be >= 1.
-
(overlap_percentAnnotated[int, Field(ge=0, le=75)], default:20) –Percentage of overlap between chunks (0-75). Defaults to 20
-
(offsetAnnotated[int, Field(ge=0)], default:0) –Starting sentence offset for chunking. Defaults to 0.
-
(token_counterCallable[[str], int] | None, default:None) –Optional token counting function. Required for token-based modes only.
-
(base_metadatadict[str, Any] | None, default:None) –Optional dictionary to be included with each chunk.
Returns:
-
list[DotDict]–A list of
DotDictobjects, each containing the chunk content and metadata.
Raises:
-
InvalidInputError–If any chunking configuration parameter is invalid.
-
MissingTokenCounterError–If
max_tokensis provided but notoken_counteris provided. -
CallbackError–If an error occurs during sentence splitting or token counting within a chunking task.
Source code in src/chunklet/document_chunker/_plain_text_chunker.py
441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 | |