compass.plugin.one_shot.base.create_schema_based_one_shot_extraction_plugin#

create_schema_based_one_shot_extraction_plugin(config, tech)[source]#

Create a one-shot extraction plugin based on a configuration

Parameters:

config (dict or path-like) –
One-shot configuration dictionary. If not a dictionary, should be a path to a file containing the configuration (supported formats: JSON, JSON5, YAML, TOML). See the wind ordinance schema for an example. The configuration must include the following keys:
- schema: A dictionary representing the schema of the output. Can also be a path to a file that contains the schema (supported formats: JSON, JSON5, YAML, TOML). See the wind ordinance schema for an example.
The configuration can also include the following optional keys:
- data_type_short_desc: Short description of the type of data being extracted with this plugin, in the format wind energy ordinance, solar energy ordinance, water rights. This is used to enhance the prompts for the structured data extraction.
- query_templates: A list of search engine query templates for document retrieval. Templates should include {jurisdiction} as a placeholder for the jurisdiction that is being processed. If not provided, the LLM will be used to generate search engine queries based on the schema input.
- website_keywords: A dictionary mapping keywords to scores for filtering websites during document retrieval. If not provided, the LLM will be used to generate website keywords based on the schema input.
- heuristic_keywords: A dictionary containing the keyword lists used by the heuristic document filter. The dictionary must include not_tech_words, good_tech_keywords, good_tech_acronyms, and good_tech_phrases keys. Alternatively, this input can simply be True, in which case the LLM will be used to generate heuristic keyword lists based on the schema input. If False, None, or not provided, a NoOp heuristic that always returns True will be used (not recommended if doing website crawling).
- collection_prompts: A list of prompts to use for collecting relevant text from documents. Alternatively, this input can simply be True, in which case the LLM will be used to generate the collection prompts. If False, None, or not provided, the entire document text will be used for extraction (no text collection).
- text_extraction_prompts: A list of prompts to use for consolidating and extracting relevant text from the documents. Alternatively, this input can simply be True, in which case the LLM will be used to generate the text extraction prompts. If False, None, or not provided, the entire document text will be used for extraction (no text consolidation).
- cache_llm_generated_content: Boolean flag indicating whether or not to cache generated query templates and website keywords for future use. By default, True. Caching is recommended since the generation of query templates and website keywords can be costly, but if you are iterating on the configuration and want to see the effect of changes to the schema on the generated query templates and website keywords in real time, you may want to set this flag to False to avoid caching generated templates/keywords until you have finalized the schema.
- extraction_system_prompt: Custom system prompt to use for the structured data extraction step. If not provided, a default prompt will be used that instructs the LLM to extract structured data from the given document(s). You may provide a custom system prompt if you want to provide more specific instructions to the LLM for the structured data extraction step.
- doc_selection_method: String defining the multi-doc selection option. Specifically, if multiple documents pass the filter, this method determines how the documents are submitted to the extraction context. Allowed options are:
  ”single doc”: Use the first document that returns some extracted data as the source document for the extraction context.
  
  ”multi doc context”: Submit text from multiple documents to the extraction context simultaneously.
  
  ”multi doc all”: Each document is extracted separately and the results concatenated. This may give duplicated feature results if the same feature is mentioned in multiple documents.
  
  ”multi doc mixed”: Each document is extracted separately and the results are merged together at the end. In this approach, each feature is reported at most once.
  By default, "single doc".
tech (str) – Technology identifier to use for the plugin (e.g., “wind”, “solar”). Must be unique from the identifiers of any existing plugins.

Returns:

callable() – A SchemaBasedExtractionPlugin subclass configured according to the input configuration.