compass.plugin.one_shot.base.create_schema_based_one_shot_extraction_plugin#

create_schema_based_one_shot_extraction_plugin(config, tech)[source]#

Create a one-shot extraction plugin based on a configuration

Parameters:
  • config (dict or path-like) –

    One-shot configuration dictionary. If not a dictionary, should be a path to a file containing the configuration (supported formats: JSON, JSON5, YAML, TOML). See the wind ordinance schema for an example. The configuration must include the following keys:

    • schema: A dictionary representing the schema of the output. Can also be a path to a file that contains the schema (supported formats: JSON, JSON5, YAML, TOML). See the wind ordinance schema for an example.

    The configuration can also include the following optional keys:

    • data_type_short_desc: Short description of the type of data being extracted with this plugin, in the format wind energy ordinance, solar energy ordinance, water rights. This is used to enhance the prompts for the structured data extraction.

    • query_templates: A list of search engine query templates for document retrieval. Templates should include {jurisdiction} as a placeholder for the jurisdiction that is being processed. If not provided, the LLM will be used to generate search engine queries based on the schema input.

    • website_keywords: A dictionary mapping keywords to scores for filtering websites during document retrieval. If not provided, the LLM will be used to generate website keywords based on the schema input.

    • heuristic_keywords: A dictionary containing the keyword lists used by the heuristic document filter. The dictionary must include not_tech_words, good_tech_keywords, good_tech_acronyms, and good_tech_phrases keys. Alternatively, this input can simply be True, in which case the LLM will be used to generate heuristic keyword lists based on the schema input. If False, None, or not provided, a NoOp heuristic that always returns True will be used (not recommended if doing website crawling).

    • collection_prompts: A list of prompts to use for collecting relevant text from documents. Alternatively, this input can simply be True, in which case the LLM will be used to generate the collection prompts. If False, None, or not provided, the entire document text will be used for extraction (no text collection).

    • text_extraction_prompts: A list of prompts to use for consolidating and extracting relevant text from the documents. Alternatively, this input can simply be True, in which case the LLM will be used to generate the text extraction prompts. If False, None, or not provided, the entire document text will be used for extraction (no text consolidation).

    • cache_llm_generated_content: Boolean flag indicating whether or not to cache generated query templates and website keywords for future use. By default, True. Caching is recommended since the generation of query templates and website keywords can be costly, but if you are iterating on the configuration and want to see the effect of changes to the schema on the generated query templates and website keywords in real time, you may want to set this flag to False to avoid caching generated templates/keywords until you have finalized the schema.

    • extraction_system_prompt: Custom system prompt to use for the structured data extraction step. If not provided, a default prompt will be used that instructs the LLM to extract structured data from the given document(s). You may provide a custom system prompt if you want to provide more specific instructions to the LLM for the structured data extraction step.

    • allow_multi_doc_extraction: Boolean flag indicating whether to allow multiple documents to be used for the extraction context simultaneously. By default, False, which means the first document that returns some extracted data will be marked as the source.

  • tech (str) – Technology identifier to use for the plugin (e.g., “wind”, “solar”). Must be unique from the identifiers of any existing plugins.