chunklet.document_chunker.registry
Classes:
CustomProcessorRegistry
Methods:
-
clear–Clears all registered processors from the registry.
-
extract_data–Processes a file using a processor registered for the given file extension.
-
is_registered–Check if a document processor is registered for the given file extension.
-
register–Register a document processor callback for one or more file extensions.
-
unregister–Remove document processor(s) from the registry.
Attributes:
-
processors–Returns a shallow copy of the dictionary of registered processors.
processors
property
Returns a shallow copy of the dictionary of registered processors.
This prevents external modification of the internal registry state.
clear
extract_data
Processes a file using a processor registered for the given file extension.
Parameters:
-
(file_pathstr) –The path to the file.
-
(extstr) –The file extension.
Returns:
-
tuple[ReturnType, str]–tuple[ReturnType, str]: A tuple containing the extracted data and the name of the processor used.
Raises:
-
CallbackError–If the processor callback fails or returns the wrong type.
-
InvalidInputError–If no processor is registered for the extension.
Examples:
>>> from chunklet.document_chunker.registry import CustomProcessorRegistry
>>> registry = CustomProcessorRegistry()
>>> @registry.register(".txt", name="my_txt_processor")
... def process_txt(file_path: str) -> tuple[str, dict]:
... with open(file_path, 'r') as f:
... content = f.read()
... return content, {"source": file_path}
>>> # Assuming 'sample.txt' exists with some content
>>> # result, processor_name = registry.extract_data("sample.txt", ".txt")
>>> # print(f"Extracted by {processor_name}: {result[0][:20]}...")
Source code in src/chunklet/document_chunker/registry.py
is_registered
Check if a document processor is registered for the given file extension.
register
Register a document processor callback for one or more file extensions.
This method can be used in two ways: 1. As a decorator: @registry.register(".json", ".xml", name="my_processor") def my_processor(file_path): ...
- As a direct function call: registry.register(my_processor, ".json", ".xml", name="my_processor")
Parameters:
-
(*argsAny, default:()) –The arguments, which can be either (ext1, ext2, ...) for a decorator or (callback, ext1, ext2, ...) for a direct call.
-
(namestr | None, default:None) –The name of the processor. If None, attempts to use the callback's name.
Source code in src/chunklet/document_chunker/registry.py
unregister
Remove document processor(s) from the registry.
Parameters:
-
(*extsstr, default:()) –File extensions to remove.