compass.plugin.interface.FilteredExtractionPlugin#
- class FilteredExtractionPlugin(jurisdiction, model_configs, usage_tracker=None)[source]#
Bases:
BaseExtractionPluginBase class for COMPASS extraction plugins
This class provides the standard COMPASS document filtering and text collection pipeline, allowing implementers to focus primarily on the structured data extraction step. Filtering and text collection is provided by subclassing the BaseTextCollector class and setting the TEXT_COLLECTORS property to a list of the desired text collectors.
Plugins can hook into various stages of the extraction pipeline to modify behavior, add custom processing, or integrate with external systems.
Subclasses should implement the desired hooks and override methods as needed.
- Parameters:
jurisdiction (
Jurisdiction) – Jurisdiction for which extraction is being performed.model_configs (
dict) – Dictionary where keys areLLMTasksand values areLLMConfiginstances to be used for those tasks.usage_tracker (
UsageTracker, optional) – Usage tracker instance that can be used to record the LLM call cost. By default,None.
Methods
extract_relevant_text(doc, extractor_class, ...)Condense text for extraction task
filter_docs(extraction_context[, ...])Filter down candidate documents before parsing
Get a BaseHeuristic instance with a check() method
Get a list of search engine query templates for extraction
Get a dict of website search keyword scores
Parse documents to extract structured data/information
post_filter_docs_hook(extraction_context)Post-process documents after running them through the filter
pre_filter_docs_hook(extraction_context)Pre-process documents before running them through the filter
Persist usage tracking data when a tracker is available
save_structured_data(doc_infos, out_dir)Write extracted water rights data to disk
Attributes
Class with a
check()methodIdentifier for extraction task (e.g. "water rights").
Path to jurisdiction CSV
List of search engine query templates for extraction
Classes to collect text
List of keywords
- abstract property QUERY_TEMPLATES#
List of search engine query templates for extraction
Query templates can contain the placeholder
{jurisdiction}which will be replaced with the full jurisdiction name during the search engine query.- Type:
- abstract property WEBSITE_KEYWORDS#
List of keywords
List of keywords that indicate links which should be prioritized when performing a website scrape for a document.
- Type:
- abstract property TEXT_COLLECTORS#
Classes to collect text
Should be an iterable of one or more classes to collect text for the extraction task.
- Type:
- abstract property HEURISTIC#
Class with a
check()methodThe
check()method should accept a string of text and returnTrueif the text passes the heuristic check andFalseotherwise.- Type:
- classmethod save_structured_data(doc_infos, out_dir)[source]#
Write extracted water rights data to disk
- Parameters:
List of dictionaries containing the following keys:
”jurisdiction”: An initialized Jurisdiction object representing the jurisdiction that was extracted.
”ord_db_fp”: A path to the extracted structured data stored on disk, or
Noneif no data was extracted.
out_dir (path-like) – Path to the output directory for the data.
- Returns:
int– Number of unique jurisdictions that information was found/written for.
- async pre_filter_docs_hook(extraction_context)[source]#
Pre-process documents before running them through the filter
- Parameters:
extraction_context (
ExtractionContext) – Context with downloaded documents to process.- Returns:
ExtractionContext– Context with documents to be passed onto the filtering step.
- async post_filter_docs_hook(extraction_context)[source]#
Post-process documents after running them through the filter
- Parameters:
extraction_context (
ExtractionContext) – Context with documents that passed the filtering step.- Returns:
ExtractionContext– Context with documents to be passed onto the parsing step.
- async extract_relevant_text(doc, extractor_class, model_config)[source]#
Condense text for extraction task
This method takes a text extractor and applies it to the collected document chunks to get a concise version of the text that can be used for structured data extraction.
The extracted text will be stored in the
.attrsdictionary of the input document under theextractor_class.OUT_LABELkey.- Parameters:
doc (
elm.web.document.BaseDocument) – Document containing text chunks to condense.extractor_class (
BaseTextExtractor) – Class to use for text extraction.model_config (
LLMConfig) – Configuration for the LLM model to use for text extraction.
- async get_query_templates()[source]#
Get a list of search engine query templates for extraction
Query templates can contain the placeholder
{jurisdiction}which will be replaced with the full jurisdiction name during the search engine query.
- async get_website_keywords()[source]#
Get a dict of website search keyword scores
Dictionary mapping keywords to scores that indicate links which should be prioritized when performing a website scrape for a document.
- async get_heuristic()[source]#
Get a BaseHeuristic instance with a check() method
The
check()method should accept a string of text and returnTrueif the text passes the heuristic check andFalseotherwise.
- async filter_docs(extraction_context, need_jurisdiction_verification=True)[source]#
Filter down candidate documents before parsing
- Parameters:
extraction_context (
ExtractionContext) – Context containing candidate documents to be filtered.need_jurisdiction_verification (
bool, optional) – Whether to verify that documents pertain to the correct jurisdiction. By default,True.
- Returns:
Iterableofelm.web.document.BaseDocument– Filtered documents orNoneif no documents remain.
- JURISDICTION_DATA_FP = None#
Path to jurisdiction CSV
If provided, this CSV will extend the known jurisdictions (by default, US states, counties, and townships). This CSV must have the following columns:
State: The state in which the jurisdiction is located (e.g. “Texas”)
County: The county in which the jurisdiction is located (e.g. “Travis”). This can be left blank if the jurisdiction is not associated with a county.
Subdivision: The name of the subdivision of the county in which the jurisdiction is located. Use this input for jurisdictions that do not map to counties/townships (e.g. water conservation districts, resource management plan areas, etc.). This can be left blank if the jurisdiction does not have the notion of a “subdivision”.
Jurisdiction Type: The type of jurisdiction (e.g. “county”, “township”, “city”, “special district”, “RMP”, etc.).
FIPS: The code to be used for the jurisdiction, if applicable (e.g. “48453” for Travis County, Texas, “22” for the Culberson County Groundwater Conservation District, etc.). This can be left blank if the jurisdiction does not have an applicable code.
Website: The official website for the jurisdiction, if applicable (e.g. “https://www.traviscountytx.gov/”). This can be left blank if the jurisdiction does not have an official website or if the website is not known.
- Type:
- abstractmethod async parse_docs_for_structured_data(extraction_context)#
Parse documents to extract structured data/information
- Parameters:
extraction_context (
ExtractionContext) – Context containing candidate documents to parse.- Returns:
ExtractionContextorNone– Context with extracted data/information stored in the.attrsdictionary, orNoneif no data was extracted.
- async record_usage()#
Persist usage tracking data when a tracker is available