One-Shot Extraction#

This example shows how to author a one-shot extraction schema and run it through COMPASS. The one-shot plugin uses your schema to extract structured data in a single LLM call.

Prerequisites#

Be sure to go over the COMPASS Execution Basics to understand how to set up a run environment and model run configuration. Once your one-shot schema is established, you will be executing the data extraction pipeline in the same manner as described in that example.

Create Your Schema#

To start off, you will need to create a one-shot JSON schema that describes the extraction output shape and embeds the extraction logic in schema field descriptions. The easiest way to do this is by copying wind_schema.json and adjusting it for your domain.

At a minimum, the schema must return an object with an outputs array, where each item is an extraction record with the required fields shown below:

{
    "type": "object",
    "required": ["outputs"],
    "properties": {
        "outputs": {
            "type": "array",
            "items": {
                "type": "object",
                "required": [
                    "feature", "value", "units", "section", "summary"
                ],
                "additionalProperties": false,
                "properties": {
                    ... // define each of the required fields from above
                }
            }
        }
    }
}

The main field here is feature, which is the ID of the extracted feature (e.g., a setback distance or a maximum allowed height). The other fields (value, units, section, and summary) are important for keeping the output consistent across various extractions and allowing a central database to keep track of the scraped data.

You will need to customize the schema for your particular use case by defining the allowed feature IDs (typically in an enum), as well as the rest of the required fields, and by encoding the extraction logic in the field descriptions. The schema is the core of the one-shot plugin, and the quality of the schema will directly impact the quality of the extracted data, so it’s worth spending some time to get it right!

Once the schema for the outputs array is finalized, you can add additional keys starting with a $ to encode instructions, examples, and edge case handling logic that the model can refer to when parsing the text. These extra keys are not required, and they are ignored for the purposes of creating the structure of the outputs themselves, but they often provide crucial context that improves accuracy.

For example, the wind extraction schema contains a $definitions key with detailed instructions on how to interpret setback multipliers and how to choose the most restrictive value when multiple setback distances are given in the text. This is reminiscent of the “decision logic” that you would normally encode in a decision tree for a traditional plugin, but here the logic is embedded in the schema itself and interpreted by the model at extraction time. This approach allows you to encode complex edge case handling logic without having to write any code, and it also allows you to easily update the logic by simply editing the schema.

The schema also includes a $examples key with example extractions that the model can refer to when deciding how to parse the text. You can be as detailed as you want in these instructions, and you can experiment with different outputs to tune the model’s understanding of the task and the desired output format.

The same schema includes a $instructions key with general instructions for the model to follow when parsing the text. This is a good place to reinforce the importance of following the schema and to provide any additional context that might be helpful for the model to know when performing the extraction.

Finally, the schema contains an (optional) $qualitative_features key, which contains a subset of the features defined in the schema. This list tells COMPASS to categorize these features as “qualitative”, which means that they are expected to only contain textual summaries in the summary field of the output, and that the value and units fields for these features can be ignored and should (will) be dropped from the final output. This input is not shown to the LLM, so the LLM response will not be influenced by this key.

You can add or remove as many of these extra keys as you want, and you can experiment with different ways of encoding the instructions and examples to see what works best for your particular use case. The main thing to keep in mind is that the core structure of the output must be defined by the outputs array in the schema, and any additional context or instructions should be provided through these extra keys.

Note

You can compare the one-shot wind schema to the existing decision trees in the wind energy plugin to get a feel for the translation of decision tree logic to schema descriptions.

Build a Plugin Config#

Once you have defined your schema, the hard work is done! The next step is to build a one-shot plugin config that tells COMPASS how to use the schema and how to retrieve and filter documents. As with all configs in COMPASS, you may define your plugin configuration via JSON, JSON5, YAML, or TOML.

At a minimum, you must supply a schema key (either a dictionary containing the full schema or a path to a schema file):

{
    "schema": "./wind_schema.json"
}

If you want a little bit more control over the extraction pipeline, you may specify several additional keys that let you customize query templates, website filters, and text extraction prompts:

{
    // Always required for one-shot schema extraction plugins
    "schema": "./wind_schema.json",

    // The default value for ``cache_llm_generated_content`` is
    // ``true``, but we include it here anyway for completeness
    // and to demonstrate that it can be set to ``false`` if desired.
    "cache_llm_generated_content": true,

    // By setting this option to ``true``, we indicate that we would
    // like a keyword-based heuristic to be applied, but would like
    // to use the LLM to generate heuristic keywords based on the
    // extraction schema (instead of providing custom heuristic
    // keywords).
    "heuristic_keywords": true,

    // By setting this option to ``true``, we indicate that we would
    // like a text collection (filter) step, but would like to simply
    // use the schema to guide the filtering (instead of providing
    // custom prompts).
    "collection_prompts": true,
}

The key options are listed below:

  • data_type_short_desc: Short label used in prompts (e.g., wind energy ordinance).

  • query_templates: Search queries with a {jurisdiction} placeholder.

  • website_keywords: Keyword weights for document search prioritization.

  • heuristic_keywords: mapping of good and bad keywords for heuristic text checks.

  • collection_prompts: Prompt list for chunk filtering, or true to auto-generate.

  • text_extraction_prompts: Prompt list for text consolidation, or true to auto-generate.

  • cache_llm_generated_content: Cache LLM-generated query templates and keywords. By default, true.

  • extraction_system_prompt: Optional system prompt override for extraction.

See this documentation for further details.

If you want full control over all of the options above, you can specify them directly in the config. You can also specify custom prompts for the collection and text extraction steps, which gives you even more control over the pipeline and allows you to further tune the model. Here is an example of a fully-specified config (in YAML for easier readability):

schema: ./wind_schema.json

data_type_short_desc: wind energy ordinance

query_templates:
  - "filetype:pdf {jurisdiction} wind energy conversion system ordinances"
  - "wind energy conversion system ordinances {jurisdiction}"
  - "{jurisdiction} wind WECS ordinance"
  - "Where can I find the legal text for commercial wind energy conversion system zoning ordinances in {jurisdiction}?"
  - "What is the specific legal information regarding zoning ordinances for commercial wind energy conversion systems in {jurisdiction}?"

website_keywords:
  pdf: 92160
  wecs: 46080
  wind: 23040
  zoning: 11520
  ordinance: 5760
  renewable energy: 1440
  planning: 720
  plan: 360
  government: 180
  code: 60
  area: 60
  land development: 15
  land: 3
  environment: 3
  energy: 3
  renewable: 3
  municipal: 1
  department: 1

heuristic_keywords:
  good_tech_keywords:
    - "wind"
    - "setback"
  good_tech_acronyms:
    - "wecs"
    - "wes"
    - "lwet"
    - "uwet"
    - "wef"
  good_tech_phrases:
    - "wind energy conversion"
    - "wind turbine"
    - "wind tower"
    - "wind farm"
    - "wind energy system"
    - "wind energy farm"
    - "utility wind energy system"
  not_tech_words:
    - "micro wecs"
    - "small wecs"
    - "mini wecs"
    - "private wecs"
    - "personal wecs"
    - "pwecs"
    - "rewind"
    - "small wind"
    - "micro wind"
    - "mini wind"
    - "private wind"
    - "personal wind"
    - "swecs"
    - "windbreak"
    - "windiest"
    - "winds"
    - "windshield"
    - "window"
    - "windy"
    - "wind attribute"
    - "wind blow"
    - "wind break"
    - "wind current"
    - "wind damage"
    - "wind data"
    - "wind direction"
    - "wind draft"
    - "wind erosion"
    - "wind energy resource atlas"
    - "wind load"
    - "wind movement"
    - "wind orient"
    - "wind resource"
    - "wind runway"
    - "prevailing wind"
    - "downwind"

collection_prompts:
  - key: contains_ord_info
    label: contains ordinance info
    prompt: |-
      You extract structured data from text. Return your answer in JSON
      format (not markdown). Your JSON file must include exactly two keys.
      The first key is 'wind_reqs', which is a string that summarizes all
      zoning, siting, setback, system design, and operational
      requirements/restrictions that are explicitly enacted in the text
      for a wind energy system (or wind turbine/tower) for a given
      jurisdiction. Note that wind energy bans are an important restriction
      to track. Include any **closely related provisions** if they clearly
      pertain to the **development, operation, modification, or removal** of
      wind energy systems (or wind turbines/towers). All restrictions should
      be enforceable - ignore any text that only provides a legal definition
      of the regulation. If the text does not specify any concrete zoning,
      siting, setback, system design, or operational requirements/restrictions
      for a wind energy system, set this key to `null`. The last key is '{key}',
      which is a boolean that is set to True if the text excerpt explicitly
      details zoning, siting, setback, system design, or operational
      requirements/restrictions for a wind energy system (or wind turbine/tower)
      and False otherwise.

  - key: x
    label: for utility-scale WECS
    prompt: |-
      You are a legal scholar that reads ordinance text and determines whether
      any of it applies to zoning, siting, setback, system design, or operational
      requirements/restrictions for **large wind energy systems**.
      Large wind energy systems (WES) may also be referred to as wind turbines,
      wind energy conversion systems (WECS), wind energy facilities (WEF),
      wind energy turbines (WET), large wind energy turbines (LWET),
      utility-scale wind energy turbines (UWET),
      commercial wind energy conversion systems (CWECS),
      alternate energy systems (AES), commercial energy production systems (CEPCS),
      or similar. Your client is a commercial wind developer that does not care
      about ordinances related to private, residential, micro, small,
      or medium sized wind energy systems. Ignore any text related to such systems.
      Return your answer as a dictionary in JSON format (not markdown).
      Your JSON file must include exactly two keys. The first key is 'summary'
      which contains a string that lists all of the types of wind energy systems
      the text applies to (if any). The second key is '{key}', which is a boolean
      that is set to True if any part of the text excerpt details zoning, siting,
      setback, system design, or operational requirements/restrictions for the
      **large wind energy conversion systems** (or similar) that the client is
      interested in and False otherwise.

text_extraction_prompts:
  - key: wind_energy_systems_text
    out_fn: "{jurisdiction} Wind Ordinance.txt"
    prompt: |-
      # CONTEXT #
      We want to reduce the provided excerpt to only contain information about
      **wind energy systems**. The extracted text will be used for structured
      data extraction, so it must be both **comprehensive** (retaining all relevant
      details) and **focused** (excluding unrelated content), with **zero rewriting
      or paraphrasing**. Ensure that all retained information is **directly applicable
      to wind energy systems** while preserving full context and accuracy.

      # OBJECTIVE #
      Extract all text **pertaining to wind energy systems** from the provided excerpt.

      # RESPONSE #
      Follow these guidelines carefully:

      1. ## Scope of Extraction ##:
      - Include all text that pertains to **wind energy systems**.
      - Explicitly include any text related to **bans or prohibitions** on wind energy systems.
      - Explicitly include any text related to the adoption or enactment date of the ordinance (if any).

      2. ## Exclusions ##:
      - Do **not** include text that does not pertain to wind energy systems.

      3. ## Formatting & Structure ##:
      - **Preserve _all_ section titles, headers, and numberings** for reference.
      - **Maintain the original wording, formatting, and structure** to ensure accuracy.

      4. ## Output Handling ##:
      - This is a strict extraction task — act like a text filter, **not** a summarizer or writer.
      - Do not add, explain, reword, or summarize anything.
      - The output must be a **copy-paste** of the original excerpt. **Absolutely no paraphrasing or rewriting.**
      - The output must consist **only** of contiguous or discontiguous verbatim blocks copied from the input.
      - If **no relevant text** is found, return the response: 'No relevant text.'

  - key: cleaned_text_for_extraction
    out_fn: "{jurisdiction} Utility Scale Wind Ordinance.txt"
    prompt: |-
      # CONTEXT #
      We want to reduce the provided excerpt to only contain information about
      **large wind energy systems**. The extracted text will be used for
      structured data extraction, so it must be both **comprehensive**
      (retaining all relevant details) and **focused** (excluding unrelated
      content), with **zero rewriting or paraphrasing**. Ensure that all
      retained information is **directly applicable** to large wind energy
      systems while preserving full context and accuracy.

      # OBJECTIVE #
      Extract all text **pertaining to large wind energy systems** from the provided excerpt.

      # RESPONSE #
      Follow these guidelines carefully:

      1. ## Scope of Extraction ##:
      - Include all text that pertains to **large wind energy systems**, even if they are referred to by different names such as: Wind turbines, wind energy conversion systems (wecs), wind energy facilities (wef), wind energy turbines (wet), large wind energy turbines (lwet), utility-scale wind energy turbines (uwet), commercial wind energy conversion systems (cwecs), alternate energy systems (aes), commercial energy production systems (cepcs), or similar
      - Explicitly include any text related to **bans or prohibitions** on large wind energy systems.
      - Explicitly include any text related to the adoption or enactment date of the ordinance (if any).
      - **Retain all relevant technical, design, operational, safety, environmental, and infrastructure-related provisions** that apply to the topic, such as (but not limited to):
          - Compliance with legal or regulatory standards.
          - Site, structural, or design specifications.
          - Environmental impact considerations.
          - Safety and risk mitigation measures.
          - Infrastructure, implementation, operation, and maintenance details.
          - All other **closely related provisions**.

      2. ## Exclusions ##:
      - Do **not** include text that explicitly applies **only** to private, residential, micro, small, or medium sized wind energy systems.
      - Do **not** include text that does not pertain at all to wind energy systems.

      3. ## Formatting & Structure ##:
      - **Preserve _all_ section titles, headers, and numberings** for reference.
      - **Maintain the original wording, formatting, and structure** to ensure accuracy.

      4. ## Output Handling ##:
      - This is a strict extraction task — act like a text filter, **not** a summarizer or writer.
      - Do not add, explain, reword, or summarize anything.
      - The output must be a **copy-paste** of the original excerpt. **Absolutely no paraphrasing or rewriting.**
      - The output must consist **only** of contiguous or discontiguous verbatim blocks copied from the input.
      - If **no relevant text** is found, return the response: 'No relevant text.'

extraction_system_prompt: |-
  You are a legal scholar extracting structured data from wind energy ordinances.
  Follow all instructions in the schema descriptions carefully.
  Only extract requirements for large, commercial, utility-scale wind energy systems.

Execution#

Once both the schema and plugin configuration are set up, you can run your newly created one-shot plugin alongside the standard COMPASS pipeline using the --plugin flag. The main run config still controls core pipeline settings and must include a tech value that matches your target technology.

compass process -c config.json5 \
    -p examples/one_shot_schema_extraction/plugin_config.yaml

If you are using pixi:

pixi run compass process -c config.json5 \
    -p examples/one_shot_schema_extraction/plugin_config.yaml

Add -v (or -vv) if you want log output in the terminal. See the Execution Basics example for more details on running COMPASS pipelines.