elm.web.file_loader.AsyncHTMLLoader

class AsyncHTMLLoader(pw_launch_kwargs=None, html_read_kwargs=None, html_read_coroutine=None, browser_semaphore=None, use_scrapling_stealth=False, num_pw_html_retries=3)[source]

Bases: object

Loader specifically designed to load HTML documents from the web.

Parameters:

pw_launch_kwargs (dict, optional) – Keyword-value argument pairs to pass to async_playwright.chromium.launch() (only used when reading HTML). By default, None.
html_read_kwargs (dict, optional) – Keyword-value argument pairs to pass to the html_read_coroutine. By default, None.
html_read_coroutine (callable, optional) – HTML file read coroutine. Must by an async function. Should accept HTML text as the first argument and kwargs as the rest. Must return a elm.web.document.HTMLDocument. If None, a default function that runs in the main thread is used. By default, None.
browser_semaphore (asyncio.Semaphore, optional) – Semaphore instance that can be used to limit the number of playwright browsers open concurrently. If None, no limits are applied. By default, None.
use_scrapling_stealth (bool, default=False) – Option to use scrapling stealth scripts instead of playwright-stealth. By default, False.
num_pw_html_retries (int, default=3) – Number of attempts to load HTML content. This is useful because the playwright parameters are stochastic, and sometimes a combination of them can fail to load HTML. The default value is likely a good balance between processing attempts and retrieval success. Note that the minimum number of attempts will always be 2, even if the user provides a value smaller than this. By default, 3.

Methods

fetch(url[, raw_content, ct, charset])

Load an HTML doc from a URL

Attributes

PAGE_LOAD_TIMEOUT

Default page load timeout value in milliseconds

PAGE_LOAD_TIMEOUT = 60000: Default page load timeout value in milliseconds

async fetch(url, raw_content=None, ct=None, charset=None)[source]

Load an HTML doc from a URL

Parameters:

url (str) – URL to load HTML content from.
raw_content (bytes, optional) – Raw content bytes from the URL response. This is used in case the playwright HTML load fails and we need to try loading HTML from the response content. If not provided, this step is skipped. By default, None.
ct (str, optional) – Content type from the URL response. This is used to help determine if the response content can be processed as text in the case where the playwright HTML load fails. If not provided, this step is skipped. By default, None.
charset (str, optional) – Charset from the URL response. This is used to decode the response content in the case where the playwright HTML load fails and we need to try loading HTML from the response content. If not provided, this step is skipped. By default, None.

Returns:

HTMLDocument – Document instance containing text, if the load was successful, else an empty document.