compass.web.file_loader.AsyncLocalDoclingFileLoader#

class AsyncLocalDoclingFileLoader(file_cache_coroutine=None, doc_attrs=None, to_md_kwargs=None, pytesseract_exe_fp=None, **__)[source]#

Bases: BaseAsyncFileLoader

Async local file loader using Docling

Parameters:
  • file_cache_coroutine (callable(), optional) – File caching coroutine. Can be used to cache files downloaded by this class. Must accept an BaseDocument instance as the first argument and the file content to be written as the second argument. If this method is not provided, no document caching is performed. By default, None.

  • doc_attrs (dict, optional) – Additional document attributes to add to each loaded document. By default, None.

  • to_md_kwargs (dict, optional) – Keyword-value argument pairs to pass to to Docling’s export_to_markdown() method for converting the raw content to a markdown document. Can be useful to specify image placeholders (i.e. "image_placeholder"="") or page break placeholders (i.e. "page_break_placeholder"="<!-- page break -->"). By default, ``None.

  • pytesseract_exe_fp (path-like, optional) – Path to the pytesseract executable. If specified, OCR will be used to extract text from scanned PDFs using Google’s Tesseract. By default None.

Methods

fetch(source)

Fetch a document for the given source.

fetch_all(*sources)

Fetch documents for all requested sources.

async fetch(source)#

Fetch a document for the given source.

Parameters:

source (str) – Source used to load the document.

Returns:

elm.web.document.Document – Document instance containing text, if the load was successful.

async fetch_all(*sources)#

Fetch documents for all requested sources.

Parameters:

*sources – Iterable of sources (as strings) used to fetch the documents.

Returns:

list – List of documents, one per requested sources.