compass.web.file_loader.AsyncDoclingWebFileLoader#
- class AsyncDoclingWebFileLoader(header_template=None, verify_ssl=True, aget_kwargs=None, pw_launch_kwargs=None, html_read_kwargs=None, html_read_coroutine=None, file_cache_coroutine=None, browser_semaphore=None, use_scrapling_stealth=False, num_pw_html_retries=3, to_md_kwargs=None, pytesseract_exe_fp=None, **__)[source]#
Bases:
BaseAsyncFileLoaderAsync web file loader using Docling
- Parameters:
header_template (
dict, optional) – Optional GET header template. If not specified, usesDEFAULT_HEADERS. By default,None.verify_ssl (
bool, optional) – Option to use aiohttp’s default SSL check. IfFalse, SSL certificate validation is skipped. By default,True.aget_kwargs (
dict, optional) – Other kwargs to pass toaiohttp.ClientSession.get(). By default,None.pw_launch_kwargs (
dict, optional) – Keyword-value argument pairs to pass toasync_playwright.chromium.launch(only used when reading HTML). By default,None.html_read_kwargs (
dict, optional) – Keyword-value argument pairs to pass to the html_read_coroutine. By default,None.html_read_coroutine (
callable(), optional) – HTML file read coroutine. Must by an async function. Should accept HTML text as the first argument and kwargs as the rest. Must return aelm.web.document.HTMLDocument. IfNone, a default function that runs in the main thread is used. By default,None.file_cache_coroutine (
callable(), optional) – File caching coroutine. Can be used to cache files downloaded by this class. Must accept anBaseDocumentinstance as the first argument and the file content to be written as the second argument. If this method is not provided, no document caching is performed. By default,None.browser_semaphore (
asyncio.Semaphore, optional) – Semaphore instance that can be used to limit the number of playwright browsers open concurrently. IfNone, no limits are applied. By default,None.use_scrapling_stealth (
bool, defaultFalse) – Option to use scrapling stealth scripts instead of playwright-stealth. By default,False.num_pw_html_retries (
int, default3) – Number of attempts to load HTML content. This is useful because the playwright parameters are stochastic, and sometimes a combination of them can fail to load HTML. The default value is likely a good balance between processing attempts and retrieval success. Note that the minimum number of attempts will always be 2, even if the user provides a value smaller than this. By default,3.to_md_kwargs (
dict, optional) – Keyword-value argument pairs to pass to to Docling’sexport_to_markdown()method for converting the raw content to a markdown document. Can be useful to specify image placeholders (i.e."image_placeholder"="") or page break placeholders (i.e."page_break_placeholder"="<!-- page break -->"). By default, ``None.pytesseract_exe_fp (path-like, optional) – Path to the pytesseract executable. If specified, OCR will be used to extract text from scanned PDFs using Google’s Tesseract. By default
None.
Methods
fetch(source)Fetch a document for the given source.
fetch_all(*sources)Fetch documents for all requested sources.