One-Shot Extraction#
This example shows how to author a one-shot extraction schema and run it through COMPASS. The one-shot plugin uses your schema to extract structured data in a single LLM call.
Prerequisites#
Be sure to go over the COMPASS Execution Basics to understand how to set up a run environment and model run configuration. Once your one-shot schema is established, you will be executing the data extraction pipeline in the same manner as described in that example.
Create Your Schema#
To start off, you will need to create a one-shot JSON schema that describes the extraction output shape and embeds the extraction logic in schema field descriptions. The easiest way to do this is by copying wind_schema.json and adjusting it for your domain.
At a minimum, the schema must return an object with an outputs array, where
each item is an extraction record with the required fields shown below:
{
"type": "object",
"required": ["outputs"],
"properties": {
"outputs": {
"type": "array",
"items": {
"type": "object",
"required": [
"feature", "value", "units", "section", "summary"
],
"additionalProperties": false,
"properties": {
... // define each of the required fields from above
}
}
}
}
}
The main field here is feature, which is the ID of the extracted feature
(e.g., a setback distance or a maximum allowed height). The other fields
(value, units, section, and summary) are important for keeping
the output consistent across various extractions and allowing a central database
to keep track of the scraped data.
You will need to customize the schema for your particular use case by defining the allowed feature IDs (typically in an enum), as well as the rest of the required fields, and by encoding the extraction logic in the field descriptions. The schema is the core of the one-shot plugin, and the quality of the schema will directly impact the quality of the extracted data, so it’s worth spending some time to get it right!
Once the schema for the outputs array is finalized, you can add additional
keys starting with a $ to encode instructions, examples, and edge case
handling logic that the model can refer to when parsing the text. These extra
keys are not required, and they are ignored for the purposes of creating the
structure of the outputs themselves, but they often provide crucial context
that improves accuracy.
For example, the
wind extraction schema
contains a $definitions key with detailed instructions on how to interpret
setback multipliers and how to choose the most restrictive value when multiple
setback distances are given in the text. This is reminiscent of the “decision logic”
that you would normally encode in a decision tree for a traditional plugin,
but here the logic is embedded in the schema itself and interpreted by the model
at extraction time. This approach allows you to encode complex edge case handling
logic without having to write any code, and it also allows you to easily update
the logic by simply editing the schema.
The schema also includes a $examples key with example extractions that the model
can refer to when deciding how to parse the text. You can be as detailed as you want
in these instructions, and you can experiment with different outputs to tune the
model’s understanding of the task and the desired output format.
The same schema includes a $instructions key with general instructions
for the model to follow when parsing the text. This is a good place to reinforce the
importance of following the schema and to provide any additional context that might be
helpful for the model to know when performing the extraction.
Finally, the schema contains an (optional) $qualitative_features key, which
contains a subset of the features defined in the schema. This list tells COMPASS
to categorize these features as “qualitative”, which means that they are expected to
only contain textual summaries in the summary field of the output, and that the
value and units fields for these features can be ignored and should (will) be
dropped from the final output. This input is not shown to the LLM, so the LLM response
will not be influenced by this key.
You can add or remove as many of these extra keys as you want, and you can experiment with
different ways of encoding the instructions and examples to see what works best for your
particular use case. The main thing to keep in mind is that the core structure of the
output must be defined by the outputs array in the schema, and any additional context
or instructions should be provided through these extra keys.
Note
You can compare the one-shot wind schema to the existing decision trees in the wind energy plugin to get a feel for the translation of decision tree logic to schema descriptions.
Build a Plugin Config#
Once you have defined your schema, the hard work is done! The next step is to build a one-shot plugin config that tells COMPASS how to use the schema and how to retrieve and filter documents. As with all configs in COMPASS, you may define your plugin configuration via JSON, JSON5, YAML, or TOML.
At a minimum, you must supply a schema key (either a dictionary containing the
full schema or a path to a schema file):
{
"schema": "./wind_schema.json"
}
If you want a little bit more control over the extraction pipeline, you may specify several additional keys that let you customize query templates, website filters, and text extraction prompts:
{
// Always required for one-shot schema extraction plugins
"schema": "./wind_schema.json",
// The default value for ``cache_llm_generated_content`` is
// ``true``, but we include it here anyway for completeness
// and to demonstrate that it can be set to ``false`` if desired.
"cache_llm_generated_content": true,
// By setting this option to ``true``, we indicate that we would
// like a keyword-based heuristic to be applied, but would like
// to use the LLM to generate heuristic keywords based on the
// extraction schema (instead of providing custom heuristic
// keywords).
"heuristic_keywords": true,
// By setting this option to ``true``, we indicate that we would
// like a text collection (filter) step, but would like to simply
// use the schema to guide the filtering (instead of providing
// custom prompts).
"collection_prompts": true,
}
The key options are listed below:
data_type_short_desc: Short label used in prompts (e.g.,wind energy ordinance).query_templates: Search queries with a{jurisdiction}placeholder.website_keywords: Keyword weights for document search prioritization.heuristic_keywords: mapping of good and bad keywords for heuristic text checks.collection_prompts: Prompt list for chunk filtering, ortrueto auto-generate.text_extraction_prompts: Prompt list for text consolidation, ortrueto auto-generate.cache_llm_generated_content: Cache LLM-generated query templates and keywords. By default,true.extraction_system_prompt: Optional system prompt override for extraction.
See this documentation for further details.
If you want full control over all of the options above, you can specify them directly in the config. You can also specify custom prompts for the collection and text extraction steps, which gives you even more control over the pipeline and allows you to further tune the model. Here is an example of a fully-specified config (in YAML for easier readability):
schema: ./wind_schema.json
data_type_short_desc: wind energy ordinance
query_templates:
- "filetype:pdf {jurisdiction} wind energy conversion system ordinances"
- "wind energy conversion system ordinances {jurisdiction}"
- "{jurisdiction} wind WECS ordinance"
- "Where can I find the legal text for commercial wind energy conversion system zoning ordinances in {jurisdiction}?"
- "What is the specific legal information regarding zoning ordinances for commercial wind energy conversion systems in {jurisdiction}?"
website_keywords:
pdf: 92160
wecs: 46080
wind: 23040
zoning: 11520
ordinance: 5760
renewable energy: 1440
planning: 720
plan: 360
government: 180
code: 60
area: 60
land development: 15
land: 3
environment: 3
energy: 3
renewable: 3
municipal: 1
department: 1
heuristic_keywords:
good_tech_keywords:
- "wind"
- "setback"
good_tech_acronyms:
- "wecs"
- "wes"
- "lwet"
- "uwet"
- "wef"
good_tech_phrases:
- "wind energy conversion"
- "wind turbine"
- "wind tower"
- "wind farm"
- "wind energy system"
- "wind energy farm"
- "utility wind energy system"
not_tech_words:
- "micro wecs"
- "small wecs"
- "mini wecs"
- "private wecs"
- "personal wecs"
- "pwecs"
- "rewind"
- "small wind"
- "micro wind"
- "mini wind"
- "private wind"
- "personal wind"
- "swecs"
- "windbreak"
- "windiest"
- "winds"
- "windshield"
- "window"
- "windy"
- "wind attribute"
- "wind blow"
- "wind break"
- "wind current"
- "wind damage"
- "wind data"
- "wind direction"
- "wind draft"
- "wind erosion"
- "wind energy resource atlas"
- "wind load"
- "wind movement"
- "wind orient"
- "wind resource"
- "wind runway"
- "prevailing wind"
- "downwind"
collection_prompts:
- key: contains_ord_info
label: contains ordinance info
prompt: |-
You extract structured data from text. Return your answer in JSON
format (not markdown). Your JSON file must include exactly two keys.
The first key is 'wind_reqs', which is a string that summarizes all
zoning, siting, setback, system design, and operational
requirements/restrictions that are explicitly enacted in the text
for a wind energy system (or wind turbine/tower) for a given
jurisdiction. Note that wind energy bans are an important restriction
to track. Include any **closely related provisions** if they clearly
pertain to the **development, operation, modification, or removal** of
wind energy systems (or wind turbines/towers). All restrictions should
be enforceable - ignore any text that only provides a legal definition
of the regulation. If the text does not specify any concrete zoning,
siting, setback, system design, or operational requirements/restrictions
for a wind energy system, set this key to `null`. The last key is '{key}',
which is a boolean that is set to True if the text excerpt explicitly
details zoning, siting, setback, system design, or operational
requirements/restrictions for a wind energy system (or wind turbine/tower)
and False otherwise.
- key: x
label: for utility-scale WECS
prompt: |-
You are a legal scholar that reads ordinance text and determines whether
any of it applies to zoning, siting, setback, system design, or operational
requirements/restrictions for **large wind energy systems**.
Large wind energy systems (WES) may also be referred to as wind turbines,
wind energy conversion systems (WECS), wind energy facilities (WEF),
wind energy turbines (WET), large wind energy turbines (LWET),
utility-scale wind energy turbines (UWET),
commercial wind energy conversion systems (CWECS),
alternate energy systems (AES), commercial energy production systems (CEPCS),
or similar. Your client is a commercial wind developer that does not care
about ordinances related to private, residential, micro, small,
or medium sized wind energy systems. Ignore any text related to such systems.
Return your answer as a dictionary in JSON format (not markdown).
Your JSON file must include exactly two keys. The first key is 'summary'
which contains a string that lists all of the types of wind energy systems
the text applies to (if any). The second key is '{key}', which is a boolean
that is set to True if any part of the text excerpt details zoning, siting,
setback, system design, or operational requirements/restrictions for the
**large wind energy conversion systems** (or similar) that the client is
interested in and False otherwise.
text_extraction_prompts:
- key: wind_energy_systems_text
out_fn: "{jurisdiction} Wind Ordinance.txt"
prompt: |-
# CONTEXT #
We want to reduce the provided excerpt to only contain information about
**wind energy systems**. The extracted text will be used for structured
data extraction, so it must be both **comprehensive** (retaining all relevant
details) and **focused** (excluding unrelated content), with **zero rewriting
or paraphrasing**. Ensure that all retained information is **directly applicable
to wind energy systems** while preserving full context and accuracy.
# OBJECTIVE #
Extract all text **pertaining to wind energy systems** from the provided excerpt.
# RESPONSE #
Follow these guidelines carefully:
1. ## Scope of Extraction ##:
- Include all text that pertains to **wind energy systems**.
- Explicitly include any text related to **bans or prohibitions** on wind energy systems.
- Explicitly include any text related to the adoption or enactment date of the ordinance (if any).
2. ## Exclusions ##:
- Do **not** include text that does not pertain to wind energy systems.
3. ## Formatting & Structure ##:
- **Preserve _all_ section titles, headers, and numberings** for reference.
- **Maintain the original wording, formatting, and structure** to ensure accuracy.
4. ## Output Handling ##:
- This is a strict extraction task — act like a text filter, **not** a summarizer or writer.
- Do not add, explain, reword, or summarize anything.
- The output must be a **copy-paste** of the original excerpt. **Absolutely no paraphrasing or rewriting.**
- The output must consist **only** of contiguous or discontiguous verbatim blocks copied from the input.
- If **no relevant text** is found, return the response: 'No relevant text.'
- key: cleaned_text_for_extraction
out_fn: "{jurisdiction} Utility Scale Wind Ordinance.txt"
prompt: |-
# CONTEXT #
We want to reduce the provided excerpt to only contain information about
**large wind energy systems**. The extracted text will be used for
structured data extraction, so it must be both **comprehensive**
(retaining all relevant details) and **focused** (excluding unrelated
content), with **zero rewriting or paraphrasing**. Ensure that all
retained information is **directly applicable** to large wind energy
systems while preserving full context and accuracy.
# OBJECTIVE #
Extract all text **pertaining to large wind energy systems** from the provided excerpt.
# RESPONSE #
Follow these guidelines carefully:
1. ## Scope of Extraction ##:
- Include all text that pertains to **large wind energy systems**, even if they are referred to by different names such as: Wind turbines, wind energy conversion systems (wecs), wind energy facilities (wef), wind energy turbines (wet), large wind energy turbines (lwet), utility-scale wind energy turbines (uwet), commercial wind energy conversion systems (cwecs), alternate energy systems (aes), commercial energy production systems (cepcs), or similar
- Explicitly include any text related to **bans or prohibitions** on large wind energy systems.
- Explicitly include any text related to the adoption or enactment date of the ordinance (if any).
- **Retain all relevant technical, design, operational, safety, environmental, and infrastructure-related provisions** that apply to the topic, such as (but not limited to):
- Compliance with legal or regulatory standards.
- Site, structural, or design specifications.
- Environmental impact considerations.
- Safety and risk mitigation measures.
- Infrastructure, implementation, operation, and maintenance details.
- All other **closely related provisions**.
2. ## Exclusions ##:
- Do **not** include text that explicitly applies **only** to private, residential, micro, small, or medium sized wind energy systems.
- Do **not** include text that does not pertain at all to wind energy systems.
3. ## Formatting & Structure ##:
- **Preserve _all_ section titles, headers, and numberings** for reference.
- **Maintain the original wording, formatting, and structure** to ensure accuracy.
4. ## Output Handling ##:
- This is a strict extraction task — act like a text filter, **not** a summarizer or writer.
- Do not add, explain, reword, or summarize anything.
- The output must be a **copy-paste** of the original excerpt. **Absolutely no paraphrasing or rewriting.**
- The output must consist **only** of contiguous or discontiguous verbatim blocks copied from the input.
- If **no relevant text** is found, return the response: 'No relevant text.'
extraction_system_prompt: |-
You are a legal scholar extracting structured data from wind energy ordinances.
Follow all instructions in the schema descriptions carefully.
Only extract requirements for large, commercial, utility-scale wind energy systems.
Execution#
Once both the schema and plugin configuration are set up, you can run your newly created
one-shot plugin alongside the standard COMPASS pipeline using the --plugin flag.
The main run config still controls core pipeline settings and must include a tech
value that matches your target technology.
compass process -c config.json5 \
-p examples/one_shot_schema_extraction/plugin_config.yaml
If you are using pixi:
pixi run compass process -c config.json5 \
-p examples/one_shot_schema_extraction/plugin_config.yaml
Add -v (or -vv) if you want log output in the terminal.
See the Execution Basics example
for more details on running COMPASS pipelines.