compass.web.file_loader.AsyncLocalDoclingFileLoader#
- class AsyncLocalDoclingFileLoader(file_cache_coroutine=None, doc_attrs=None, to_md_kwargs=None, pytesseract_exe_fp=None, **__)[source]#
Bases:
BaseAsyncFileLoaderAsync local file loader using Docling
- Parameters:
file_cache_coroutine (
callable(), optional) – File caching coroutine. Can be used to cache files downloaded by this class. Must accept anBaseDocumentinstance as the first argument and the file content to be written as the second argument. If this method is not provided, no document caching is performed. By default,None.doc_attrs (
dict, optional) – Additional document attributes to add to each loaded document. By default,None.to_md_kwargs (
dict, optional) – Keyword-value argument pairs to pass to to Docling’sexport_to_markdown()method for converting the raw content to a markdown document. Can be useful to specify image placeholders (i.e."image_placeholder"="") or page break placeholders (i.e."page_break_placeholder"="<!-- page break -->"). By default, ``None.pytesseract_exe_fp (path-like, optional) – Path to the pytesseract executable. If specified, OCR will be used to extract text from scanned PDFs using Google’s Tesseract. By default
None.
Methods
fetch(source)Fetch a document for the given source.
fetch_all(*sources)Fetch documents for all requested sources.