compass.pipeline.data_classes.BaseRequest#

class BaseRequest(out_dir, tech, jurisdiction_fp, *, model='gpt-4o-mini', llm_costs=None, num_urls_to_check_per_jurisdiction=5, max_num_concurrent_browsers=10, max_num_concurrent_website_searches=10, max_num_concurrent_jurisdictions=25, url_ignore_substrings=None, url_keep_substrings=None, known_local_docs=None, known_doc_urls=None, file_loader_kwargs=None, search_engines=None, simple_se_result_sort=True, pytesseract_exe_fp=None, td_kwargs=None, tpe_kwargs=None, ppe_kwargs=None, log_dir=None, clean_dir=None, ordinance_file_dir=None, jurisdiction_dbs_dir=None, perform_se_search=True, perform_website_search=True, make_paths_relative=False, log_level='INFO', keep_async_logs=False, collection_manifest_fp=None)[source]#

Bases: object

Parameter Object base class for pipeline requests

Parameters:
  • out_dir (path-like) – Path to the output directory. If it does not exist, it will be created. This directory will contain the saved collection manifest, downloaded ordinance documents, parsed document text, usage metadata, and default subdirectories for logs and intermediate outputs (unless otherwise specified).

  • tech (str) – Label indicating which technology type is being processed. Must be one of the keys of PLUGIN_REGISTRY.

  • jurisdiction_fp (path-like) – Path to a CSV file specifying the jurisdictions to process. The CSV must contain at least two columns: “County” and “State”, which specify the county and state names, respectively. If you would like to process a subdivision with a county, you must also include “Subdivision” and “Jurisdiction Type” columns. The “Subdivision” should be the name of the subdivision, and the “Jurisdiction Type” should be a string identifying the type of subdivision (e.g., “City”, “Township”, etc.)

  • model (str or list of dict, default "gpt-4o-mini") –

    LLM model(s) to use for scraping and parsing ordinance documents. If a string is provided, it is assumed to be the name of the default model (e.g., “gpt-4o”), and environment variables are used for authentication.

    If a list is provided, it should contain dictionaries of arguments that can initialize instances of OpenAIConfig. Each dictionary can specify the model name, client type, and initialization arguments.

    Each dictionary must also include a tasks key, which maps to a string or list of strings indicating the tasks that instance should handle. Exactly one of the instances must include “default” as a task, which will be used when no specific task is matched. For example:

    "model": [
        {
            "model": "gpt-4o-mini",
            "llm_call_kwargs": {
                "temperature": 0,
                "timeout": 300,
            },
            "client_kwargs": {
                "api_key": "<your_api_key>",
                "api_version": "<your_api_version>",
                "azure_endpoint": "<your_azure_endpoint>",
            },
            "tasks": ["default", "date_extraction"],
        },
        {
            "model": "gpt-4o",
            "client_type": "openai",
            "tasks": ["ordinance_text_extraction"],
        }
    ]
    

    Important

    You will need to ensure that the model name used here matches your deployment if you are using Azure OpenAI. For example, if you deployed the GPT-4o-mini model under the name "gpt-4o-mini-2025-04-11", you would want to set "model": "gpt-4o-mini-2025-04-11".

    By default, "gpt-4o-mini".

  • llm_costs (dict, optional) –

    Dictionary mapping model names to their token costs, used to track the estimated total cost of LLM usage during the run. The structure should be:

    {"model_name": {"prompt": float, "response": float}}
    

    Costs are specified in dollars per million tokens. For example:

    "llm_costs": {"my_gpt": {"prompt": 1.5, "response": 3}}
    

    registers a model named “my_gpt” with a cost of $1.5 per million input (prompt) tokens and $3 per million output (response) tokens for the current processing run.

    Note

    The displayed total cost does not track cached tokens, so treat it like an estimate. Your final API costs may vary.

    If set to None, no custom model costs are recorded, and cost tracking may be unavailable in the progress bar. By default, None.

  • num_urls_to_check_per_jurisdiction (int, default 5) – Number of unique Google search result URLs to check for each jurisdiction when attempting to locate ordinance documents. By default, 5.

  • max_num_concurrent_browsers (int, default 10) – Maximum number of browser instances to launch concurrently for retrieving information from the web. Increasing this value too much may lead to timeouts or performance issues on machines with limited resources. By default, 10.

  • max_num_concurrent_website_searches (int, default 10) – Maximum number of website searches allowed to run simultaneously. Increasing this value can speed up searches, but may lead to timeouts or performance issues on machines with limited resources. By default, 10.

  • max_num_concurrent_jurisdictions (int, default 25) – Maximum number of jurisdictions to process concurrently. Limiting this can help manage memory usage when dealing with a large number of documents. By default, 25.

  • url_ignore_substrings (list of str, optional) –

    A list of substrings that, if found in any URL, will cause the URL to be excluded from consideration. This can be used to specify particular websites or entire domains to ignore. For example:

    url_ignore_substrings = [
        "wikipedia",
        "nlr.gov",
        "www.co.delaware.in.us/documents/1649699794_0382.pdf",
    ]
    

    The above configuration would ignore all wikipedia articles, all websites on the NLR domain, and the specific file located at www.co.delaware.in.us/documents/1649699794_0382.pdf. This input will include all of the blacklisted domains from NatLabRockies/COMPASS, so you will need to whitelist any domains in that list that you want to allow. By default, None.

  • url_keep_substrings (list of str, optional) –

    A list of substrings that, if found in any URL, will cause the URL to be kept (regardless of the default blacklist or the url_ignore_substrings input) in search results. For example:

    url_keep_substrings = [
        "my_ordinance_collection.edu",
    ]
    

    The above configuration would keep all url results from “my_ordinance_collection.edu” despite the fact that .edu urls are blacklisted by default. By default, None.

  • known_local_docs (dict or path-like, optional) – A dictionary where keys are the jurisdiction codes (as strings) and values are lists of dictionaries containing information about each local document. Each document dictionary should contain at least the key "source_fp" pointing to the full local document path. Additional keys are copied onto the loaded document as attributes. This input can also be a path to a JSON file containing the same mapping. By default, None.

  • known_doc_urls (dict or path-like, optional) – A dictionary where keys are the jurisdiction codes (as strings) and values are lists of dictionaries containing information about each known URL to check. Each document dictionary should contain at least the key "source" representing the known document URL. Additional keys are copied onto the loaded document as attributes. This input can also be a path to a JSON file containing the same mapping. By default, None.

  • file_loader_kwargs (dict, optional) – Dictionary of keyword argument pairs to initialize elm.web.file_loader.AsyncWebFileLoader. If found, the "pw_launch_kwargs" key in these will also be used to initialize the Playwright-backed Google search used for search engine retrieval. By default, None.

  • search_engines (list, optional) – A list of dictionaries describing the search engine classes and keyword arguments to use for search engine retrieval. If None, the default search engine configurations and fallback order are used. By default, None.

  • simple_se_result_sort (bool, default True) – Flag indicating whether to use a simple top-n sort from the first search engine that gives results (True) or to apply a holistic link sorting based on all results from all search engines (False). By default, True.

  • pytesseract_exe_fp (path-like, optional) – Path to the pytesseract executable. If specified, OCR will be used to extract text from scanned PDFs using Google’s Tesseract. By default, None.

  • td_kwargs (dict, optional) – Additional keyword arguments to pass to tempfile.TemporaryDirectory. The temporary directory is used to store documents which have not yet been confirmed to contain relevant information. By default, None.

  • tpe_kwargs (dict, optional) – Additional keyword arguments to pass to concurrent.futures.ThreadPoolExecutor, used for I/O-bound tasks such as logging and file writes. By default, None.

  • ppe_kwargs (dict, optional) – Additional keyword arguments to pass to concurrent.futures.ProcessPoolExecutor, used for CPU-bound tasks such as PDF loading and parsing. By default, None.

  • log_dir (path-like, optional) – Path to the directory for storing log files. If not provided, a logs subdirectory will be created inside out_dir. By default, None.

  • clean_dir (path-like, optional) – Path to the directory for storing cleaned ordinance text output. If not provided, a cleaned_text subdirectory will be created inside out_dir. By default, None.

  • ordinance_file_dir (path-like, optional) – Path to the directory where downloaded ordinance files (PDFs or HTML) for each jurisdiction are stored. If not provided, a ordinance_files subdirectory will be created inside out_dir. By default, None.

  • jurisdiction_dbs_dir (path-like, optional) – Path to the directory where parsed ordinance database files are stored for each jurisdiction. If not provided, a jurisdiction_dbs subdirectory will be created inside out_dir. By default, None.

  • perform_se_search (bool, default True) – Option to perform a search engine-based search for ordinance documents. This is the standard way to collect ordinance documents, and it is recommended to leave this set to True unless you are re-processing local documents. If True, the search engine approach is used to locate ordinance documents before falling back to a website crawl-based search (if that has been selected). By default, True.

  • perform_website_search (bool, default True) – Option to fallback to a jurisdiction website crawl-based search for ordinance documents if the search engine approach fails to recover any relevant documents. By default, True.

  • make_paths_relative (bool, default False) – Option to make all file paths in the saved collection manifest relative to the output directory. This can be helpful for sharing the manifest or for ensuring that it can be loaded correctly on a different machine. If False, absolute paths are used in the manifest. By default, False.

  • log_level (str, default "INFO") – Logging level for ordinance scraping and parsing (e.g., “TRACE”, “DEBUG”, “INFO”, “WARNING”, or “ERROR”). By default, "INFO".

  • keep_async_logs (bool, default False) – Option to store the full asynchronous log record to a file. This is only useful if you intend to monitor overall processing progress from a file instead of from the terminal. If True, all of the unordered records are written to a “all.log” file in the log_dir directory. By default, False.

  • collection_manifest_fp (path-like, optional) – Path to the JSON collection manifest created by the document collection step. The manifest must contain the persisted document information needed to reload each collected document for extraction. Only needed if running in extraction mode with a separate collection step. By default, None.

Methods

Attributes

MODE

COMPASSRunMode associated with this request type

models

Mapping of LLM task to OpenAIConfig for this request

MODE = None#

COMPASSRunMode associated with this request type

property models[source]#

Mapping of LLM task to OpenAIConfig for this request

Type:

dict