compass.plugin.base.BaseExtractionPlugin#

class BaseExtractionPlugin(jurisdiction, model_configs, usage_tracker=None)[source]#

Bases: ABC

Base class for COMPASS extraction plugins

This class provides the most extraction flexibility, but the implementer must define most functionality on their own.

Parameters:
  • jurisdiction (Jurisdiction) – Jurisdiction for which extraction is being performed.

  • model_configs (dict) – Dictionary where keys are LLMTasks and values are LLMConfig instances to be used for those tasks.

  • usage_tracker (UsageTracker, optional) – Usage tracker instance that can be used to record the LLM call cost. By default, None.

Methods

filter_docs(extraction_context[, ...])

Filter down candidate documents before parsing

get_heuristic()

Get a BaseHeuristic instance with a check() method

get_query_templates()

Get a list of search engine query templates for extraction

get_website_keywords()

Get a dict of website search keyword scores

parse_docs_for_structured_data(...)

Parse documents to extract structured data/information

record_usage()

Persist usage tracking data when a tracker is available

save_structured_data(doc_infos, out_dir)

Write combined extracted structured data to disk

Attributes

IDENTIFIER

Identifier for extraction task (e.g. "water rights").

JURISDICTION_DATA_FP

Path to jurisdiction CSV

JURISDICTION_DATA_FP = None#

Path to jurisdiction CSV

If provided, this CSV will extend the known jurisdictions (by default, US states, counties, and townships). This CSV must have the following columns:

  • State: The state in which the jurisdiction is located (e.g. “Texas”)

  • County: The county in which the jurisdiction is located (e.g. “Travis”). This can be left blank if the jurisdiction is not associated with a county.

  • Subdivision: The name of the subdivision of the county in which the jurisdiction is located. Use this input for jurisdictions that do not map to counties/townships (e.g. water conservation districts, resource management plan areas, etc.). This can be left blank if the jurisdiction does not have the notion of a “subdivision”.

  • Jurisdiction Type: The type of jurisdiction (e.g. “county”, “township”, “city”, “special district”, “RMP”, etc.).

  • FIPS: The code to be used for the jurisdiction, if applicable (e.g. “48453” for Travis County, Texas, “22” for the Culberson County Groundwater Conservation District, etc.). This can be left blank if the jurisdiction does not have an applicable code.

  • Website: The official website for the jurisdiction, if applicable (e.g. “https://www.traviscountytx.gov/”). This can be left blank if the jurisdiction does not have an official website or if the website is not known.

Type:

path-like

abstract property IDENTIFIER#

Identifier for extraction task (e.g. “water rights”)

Type:

str

abstractmethod async get_query_templates()[source]#

Get a list of search engine query templates for extraction

Query templates can contain the placeholder {jurisdiction} which will be replaced with the full jurisdiction name during the search engine query.

abstractmethod async get_website_keywords()[source]#

Get a dict of website search keyword scores

Dictionary mapping keywords to scores that indicate links which should be prioritized when performing a website scrape for a document.

abstractmethod async get_heuristic()[source]#

Get a BaseHeuristic instance with a check() method

The check() method should accept a string of text and return True if the text passes the heuristic check and False otherwise.

abstractmethod async filter_docs(extraction_context, need_jurisdiction_verification=True)[source]#

Filter down candidate documents before parsing

Parameters:
  • extraction_context (ExtractionContext) – Context containing candidate documents to be filtered. Set the .documents attribute of this object to be the iterable of documents that should be kept for parsing.

  • need_jurisdiction_verification (bool, optional) – Whether to verify that documents pertain to the correct jurisdiction. By default, True.

Returns:

ExtractionContext – Context with filtered down documents.

abstractmethod async parse_docs_for_structured_data(extraction_context)[source]#

Parse documents to extract structured data/information

Parameters:

extraction_context (ExtractionContext) – Context containing candidate documents to parse.

Returns:

ExtractionContext or None – Context with extracted data/information stored in the .attrs dictionary, or None if no data was extracted.

abstractmethod classmethod save_structured_data(doc_infos, out_dir)[source]#

Write combined extracted structured data to disk

Parameters:
  • doc_infos (list of dict) –

    List of dictionaries containing the following keys:

    • ”jurisdiction”: An initialized Jurisdiction object representing the jurisdiction that was extracted.

    • ”ord_db_fp”: A path to the extracted structured data stored on disk, or None if no data was extracted.

  • out_dir (path-like) – Path to the output directory for the data.

Returns:

int – Number of jurisdictions for which data was successfully extracted.

async record_usage()[source]#

Persist usage tracking data when a tracker is available