compass.scripts.download.download_jurisdiction_ordinances_from_website_compass_crawl#
- async download_jurisdiction_ordinances_from_website_compass_crawl(website, heuristic, keyword_points, file_loader_kwargs=None, already_visited=None, num_link_scores_to_check_per_page=4, max_urls=100, crawl_semaphore=None, pb_jurisdiction_name=None)[source]#
Download ord documents from a website using the COMPASS crawler
The COMPASS crawler is much more simplistic than the Crawl4AI crawler, but is designed to access some links that Crawl4AI cannot (such as those behind a button interface).
- Parameters:
website (
str) – URL of the jurisdiction website to search.heuristic (
callable()) – Callable taking anelm.web.document.BaseDocumentand returningTruewhen the document should be kept.keyword_points (
dict) – Dictionary of keyword points to use for scoring links. Keys are keywords, values are points to assign to links containing the keyword. If a link contains multiple keywords, the points are summed up.file_loader_kwargs (
dict, optional) – Dictionary of keyword arguments pairs to initializeelm.web.file_loader.AsyncWebFileLoader. If found, the “pw_launch_kwargs” key in these will also be used to initialize theelm.web.search.google.PlaywrightGoogleLinkSearchused for the Google URL search. By default,None.already_visited (
setofstr, optional) – URLs that have already been crawled and should be skipped. By default,None.num_link_scores_to_check_per_page (
int, default4) – Number of top-scoring links to visit per page. By default,4.max_urls (
int, default100) – Max number of URLs to check from the website before terminating the search. By default,100.crawl_semaphore (
asyncio.Semaphore, optional) – Semaphore instance that can be used to limit the number of website crawls happening concurrently. IfNone, no limits are applied. By default,None.pb_jurisdiction_name (
str, optional) – Optional jurisdiction name to use to update progress bar, if it’s being used. By default,None.
- Returns:
out_docs (
list) – List ofBaseDocumentinstances containing potential ordinance information, or an empty list if no ordinance document was found.
Notes
Requires
TempFileCacheservice to be running.