compass.extraction.water.plugin.TexasWaterRightsExtractor#

class TexasWaterRightsExtractor(jurisdiction, model_configs, usage_tracker=None)[source]#

Bases: BaseExtractionPlugin

COMPASS solar extraction plugin

Parameters:
  • jurisdiction (Jurisdiction) – Jurisdiction for which extraction is being performed.

  • model_configs (dict) – Dictionary where keys are LLMTasks and values are LLMConfig instances to be used for those tasks.

  • usage_tracker (UsageTracker, optional) – Usage tracker instance that can be used to record the LLM call cost. By default, None.

Methods

filter_docs(extraction_context[, ...])

Filter down candidate documents before parsing

get_heuristic()

Get a BaseHeuristic instance with a check() method

get_query_templates()

Get a list of search engine query templates for extraction

get_website_keywords()

Get a dict of website search keyword scores

parse_docs_for_structured_data(...)

Parse documents to extract structured data/information

record_usage()

Persist usage tracking data when a tracker is available

save_structured_data(doc_infos, out_dir)

Write extracted water rights data to disk

Attributes

IDENTIFIER

Identifier for extraction task

JURISDICTION_DATA_FP

Path to Texas GCW names

IDENTIFIER = 'tx water rights'#

Identifier for extraction task

Type:

str

JURISDICTION_DATA_FP = PosixPath('/home/runner/work/COMPASS/COMPASS/compass/data/tx_water_districts.csv')#

Path to Texas GCW names

Type:

path-like

async get_query_templates()[source]#

Get a list of search engine query templates for extraction

Query templates can contain the placeholder {jurisdiction} which will be replaced with the full jurisdiction name during the search engine query.

async get_website_keywords()[source]#

Get a dict of website search keyword scores

Dictionary mapping keywords to scores that indicate links which should be prioritized when performing a website scrape for a document.

async get_heuristic()[source]#

Get a BaseHeuristic instance with a check() method

The check() method should accept a string of text and return True if the text passes the heuristic check and False otherwise.

async filter_docs(extraction_context, need_jurisdiction_verification=True)[source]#

Filter down candidate documents before parsing

Parameters:
  • extraction_context (ExtractionContext) – Context containing candidate documents to be filtered. Set the .documents attribute of this object to be the iterable of documents that should be kept for parsing.

  • need_jurisdiction_verification (bool, optional) – Whether to verify that documents pertain to the correct jurisdiction. By default, True.

Returns:

ExtractionContext – Context with filtered down documents.

async parse_docs_for_structured_data(extraction_context)[source]#

Parse documents to extract structured data/information

Parameters:

extraction_context (ExtractionContext) – Context containing candidate documents to parse.

Returns:

ExtractionContext or None – Context with extracted data/information stored in the .attrs dictionary, or None if no data was extracted.

classmethod save_structured_data(doc_infos, out_dir)[source]#

Write extracted water rights data to disk

Parameters:
  • doc_infos (list of dict) –

    List of dictionaries containing the following keys:

    • ”jurisdiction”: An initialized Jurisdiction object representing the jurisdiction that was extracted.

    • ”ord_db_fp”: A path to the extracted structured data stored on disk, or None if no data was extracted.

  • out_dir (path-like) – Path to the output directory for the data.

Returns:

int – Number of unique water rights districts that information was found/written for.

async record_usage()#

Persist usage tracking data when a tracker is available