compass.extraction.wind.plugin.COMPASSWindExtractor#

class COMPASSWindExtractor(jurisdiction, model_configs, usage_tracker=None)[source]#

Bases: OrdinanceExtractionPlugin

COMPASS wind extraction plugin

Parameters:

jurisdiction (Jurisdiction) – Jurisdiction for which extraction is being performed.
model_configs (dict) – Dictionary where keys are LLMTasks and values are LLMConfig instances to be used for those tasks.
usage_tracker (UsageTracker, optional) – Usage tracker instance that can be used to record the LLM call cost. By default, None.

Methods

`extract_ordinances_from_text`(doc, ...)	Extract structured data from input text
`extract_relevant_text`(doc, extractor_class, ...)	Condense text for extraction task
`filter_docs`(extraction_context[, ...])	Filter down candidate documents before parsing
`get_heuristic`()	Get a BaseHeuristic instance with a check() method
`get_query_templates`()	Get a list of search engine query templates for extraction
`get_structured_data_row_count`(data_df)	Get the number of data rows extracted from a document
`get_website_keywords`()	Get a dict of website search keyword scores
`parse_docs_for_structured_data`(...)	Parse documents to extract structured data/information
`parse_for_structured_data`(source)	Extract all possible structured data from a document
`parse_multi_doc_context_for_structured_data`(...)	Parse all documents to extract structured data/information
`parse_single_doc_for_structured_data`(...)	Parse documents one at a time to extract structured data
`post_filter_docs_hook`(extraction_context)	Post-process documents after running them through the filter
`pre_filter_docs_hook`(extraction_context)	Pre-process documents before running them through the filter
`record_usage`()	Persist usage tracking data when a tracker is available
`save_structured_data`(doc_infos, out_dir)	Write extracted water rights data to disk

Attributes

`ALLOW_MULTI_DOC_EXTRACTION`	Whether to allow extraction over multiple documents
`IDENTIFIER`	Identifier for extraction task
`JURISDICTION_DATA_FP`	Path to jurisdiction CSV
`PARSERS`	Class for parsing structured ordinance data from text
`QUERY_TEMPLATES`	List of search engine query templates for extraction
`TEXT_COLLECTORS`	Classes for collecting wind ordinance text chunks from docs
`TEXT_EXTRACTORS`	Class for extracting cleaned ord text from collected text
`WEBSITE_KEYWORDS`	List of keywords
`consumer_producer_pairs`	Pairs of (consumer, producer) for IN/OUT validation
`producers`	All classes that produce attributes on the doc

IDENTIFIER = 'wind'#

Identifier for extraction task

Type:: str

QUERY_TEMPLATES = ['filetype:pdf {jurisdiction} wind energy conversion system ordinances', 'wind energy conversion system ordinances {jurisdiction}', '{jurisdiction} wind WECS ordinance', 'Where can I find the legal text for commercial wind energy conversion system zoning ordinances in {jurisdiction}?', 'What is the specific legal information regarding zoning ordinances for commercial wind energy conversion systems in {jurisdiction}?']#

List of search engine query templates for extraction

Type:: list

WEBSITE_KEYWORDS = {'area': 60, 'code': 60, 'department': 1, 'energy': 3, 'environment': 3, 'government': 180, 'land': 3, 'land development': 15, 'land%20development': 15, 'land+development': 15, 'municipal': 1, 'ordinance': 5760, 'pdf': 92160, 'plan': 360, 'planning': 720, 'renewable': 3, 'renewable energy': 1440, 'renewable%20energy': 1440, 'renewable+energy': 1440, 'wecs': 46080, 'wind': 23040, 'zoning': 11520}#

List of keywords

Keywords indicate links which should be prioritized when performing a website scrape for a wind ordinance document.

Type:: list

ALLOW_MULTI_DOC_EXTRACTION = False#

Whether to allow extraction over multiple documents

Type:: bool

HEURISTIC#

BaseHeuristic: Class with a check() method

alias of WindHeuristic

JURISDICTION_DATA_FP = None#

Path to jurisdiction CSV

If provided, this CSV will extend the known jurisdictions (by default, US states, counties, and townships). This CSV must have the following columns:

State: The state in which the jurisdiction is located (e.g. “Texas”)

County: The county in which the jurisdiction is located (e.g. “Travis”). This can be left blank if the jurisdiction is not associated with a county.

Subdivision: The name of the subdivision of the county in which the jurisdiction is located. Use this input for jurisdictions that do not map to counties/townships (e.g. water conservation districts, resource management plan areas, etc.). This can be left blank if the jurisdiction does not have the notion of a “subdivision”.

Jurisdiction Type: The type of jurisdiction (e.g. “county”, “township”, “city”, “special district”, “RMP”, etc.).

FIPS: The code to be used for the jurisdiction, if applicable (e.g. “48453” for Travis County, Texas, “22” for the Culberson County Groundwater Conservation District, etc.). This can be left blank if the jurisdiction does not have an applicable code.

Website: The official website for the jurisdiction, if applicable (e.g. “https://www.traviscountytx.gov/”). This can be left blank if the jurisdiction does not have an official website or if the website is not known.

Type:: path-like

property consumer_producer_pairs#

Pairs of (consumer, producer) for IN/OUT validation

Type:: list

async extract_ordinances_from_text(doc, parser_class, model_config)#

Extract structured data from input text

The extracted structured data will be stored in the .attrs dictionary of the input document under the parser_class.OUT_LABEL key.

Parameters:

doc (elm.web.document.BaseDocument) – Document containing text to extract structured data from.
parser_class (BaseParser) – Class to use for structured data extraction.
model_config (LLMConfig) – Configuration for the LLM model to use for structured data extraction.

async extract_relevant_text(doc, extractor_class, model_config)#

Condense text for extraction task

This method takes a text extractor and applies it to the collected document chunks to get a concise version of the text that can be used for structured data extraction.

The extracted text will be stored in the .attrs dictionary of the input document under the extractor_class.OUT_LABEL key.

Parameters:

doc (elm.web.document.BaseDocument) – Document containing text chunks to condense.
extractor_class (BaseTextExtractor) – Class to use for text extraction.
model_config (LLMConfig) – Configuration for the LLM model to use for text extraction.

async filter_docs(extraction_context, need_jurisdiction_verification=True)#

Filter down candidate documents before parsing

Parameters:

extraction_context (ExtractionContext) – Context containing candidate documents to be filtered.
need_jurisdiction_verification (bool, optional) – Whether to verify that documents pertain to the correct jurisdiction. By default, True.

Returns:

Iterable of elm.web.document.BaseDocument – Filtered documents or None if no documents remain.

async get_heuristic()#

Get a BaseHeuristic instance with a check() method

The check() method should accept a string of text and return True if the text passes the heuristic check and False otherwise.

async get_query_templates()#

Get a list of search engine query templates for extraction

Query templates can contain the placeholder {jurisdiction} which will be replaced with the full jurisdiction name during the search engine query.

classmethod get_structured_data_row_count(data_df)#

Get the number of data rows extracted from a document

Parameters:: data_df (pandas.DataFrame or None) – DataFrame to check for extracted structured data.
Returns:: int – Number of data rows extracted from the document.

async get_website_keywords()#

Get a dict of website search keyword scores

Dictionary mapping keywords to scores that indicate links which should be prioritized when performing a website scrape for a document.

async parse_docs_for_structured_data(extraction_context)#

Parse documents to extract structured data/information

Parameters:: extraction_context (ExtractionContext) – Context containing candidate documents to parse.
Returns:: ExtractionContext or None – Context with extracted data/information stored in the .attrs dictionary, or None if no data was extracted.

async parse_for_structured_data(source)#

Extract all possible structured data from a document

This method is called from the default implementation of parse_single_doc_for_structured_data() for each document that passed filtering. If you overwrite parse_single_doc_for_structured_data(), you can ignore this method.

Parameters:: source (elm.web.document.BaseDocument or ExtractionContext) – Source to extract structured data from. Must have an .attrs attribute that contains text from which data should be extracted.
Returns:: pandas.DataFrame or None – DataFrame containing extracted structured data, or None if no structured data were extracted.

async parse_multi_doc_context_for_structured_data(extraction_context)#

Parse all documents to extract structured data/information

Parameters:: extraction_context (ExtractionContext) – Context containing candidate documents to parse. The text from all documents will be concatenated to create the context for the extraction.
Returns:: ExtractionContext or None – Context with extracted data/information stored in the .attrs dictionary, or None if no data was extracted.

async parse_single_doc_for_structured_data(extraction_context)#

Parse documents one at a time to extract structured data

The first document to return some extracted data will be marked as the source and will be returned from this method.

Parameters:: extraction_context (ExtractionContext) – Context containing candidate documents to parse.
Returns:: ExtractionContext or None – Context with extracted data/information stored in the .attrs dictionary, or None if no data was extracted.

async post_filter_docs_hook(extraction_context)#

Post-process documents after running them through the filter

Parameters:: extraction_context (ExtractionContext) – Context with documents that passed the filtering step.
Returns:: ExtractionContext – Context with documents to be passed onto the parsing step.

async pre_filter_docs_hook(extraction_context)#

Pre-process documents before running them through the filter

Parameters:: extraction_context (ExtractionContext) – Context with downloaded documents to process.
Returns:: ExtractionContext – Context with documents to be passed onto the filtering step.

property producers#

All classes that produce attributes on the doc

Type:: list

async record_usage()#: Persist usage tracking data when a tracker is available

classmethod save_structured_data(doc_infos, out_dir)#

Write extracted water rights data to disk

Parameters:

doc_infos (list of dict) –
List of dictionaries containing the following keys:
- ”jurisdiction”: An initialized Jurisdiction object representing the jurisdiction that was extracted.
- ”ord_db_fp”: A path to the extracted structured data stored on disk, or None if no data was extracted.
out_dir (path-like) – Path to the output directory for the data.

Returns:

int – Number of unique jurisdictions that information was found/written for.

TEXT_COLLECTORS = [<class 'compass.extraction.wind.ordinance.WindOrdinanceTextCollector'>, <class 'compass.extraction.wind.ordinance.WindPermittedUseDistrictsTextCollector'>]#: Classes for collecting wind ordinance text chunks from docs

TEXT_EXTRACTORS = [<class 'compass.extraction.wind.ordinance.WindOrdinanceTextExtractor'>, <class 'compass.extraction.wind.ordinance.WindPermittedUseDistrictsTextExtractor'>]#: Class for extracting cleaned ord text from collected text

PARSERS = [<class 'compass.extraction.wind.parse.StructuredWindOrdinanceParser'>, <class 'compass.extraction.wind.parse.StructuredWindPermittedUseDistrictsParser'>]#: Class for parsing structured ordinance data from text