Parsing Existing Docs via the CLI#
If you already have documents that you want to run data extraction on, you can skip web search and run COMPASS directly against local files. This example shows the minimal CLI setup for processing local documents.
Prerequisites#
Be sure to go over the COMPASS Execution Basics to understand how to set up a run environment and model run configuration. You will be re-using the same execution pattern here with an added input to point COMPASS to your local files.
Compile Document Info#
The key to running COMPASS against local files is compiling information
about the local documents that we can point COMPASS to. To do this, we
need to generate a mapping of jurisdiction codes (e.g. FIPS codes) to
lists of document metadata dicts, where each dict contains (at minimum)
a required source_fp key that points to the local file path.
For example, a minimal local document specification would look like this:
{
"18031": [
{
"source_fp": "../Decatur County, Indiana.pdf"
}
]
}
This mapping can be saved as a config file using any of the formats supported by COMPASS (JSON, JSON5, YAML, or TOML).
If you need to look up the jurisdiction codes to use in the mapping, you can take a look at the list of known jurisdictions in the COMPASS repository.
Since we didn’t include any additional metadata beyond the required
source_fp, COMPASS will perform all of the same document processing
steps that a document retrieved via search would go through, including
legal text validation and date extraction. To skip some or all of these
steps, you can include additional metadata fields in the document dicts
as described in the
COMPASS documentation.
Below is an example of a more fully specified document mapping that
includes multiple documents, each with additional metadata fields to
skip certain processing steps:
{
"18031": [
{
"source_fp": "../Decatur County, Indiana.pdf",
"source": "https://decaturcounty.in.gov/download/zoning-ordinance-article-13-wind-energy-conversion-system-wecs?refresh=68ffda0d84a6e1761597965&wpdmdl=6638",
"date": [null, null, null], // [year, month, day] - Skips date extraction if given
"check_if_legal_doc": false, // Skip legal doc check
// Optional metadata fields - not required but can be helpful for metadata in the run output
"checksum": "sha256:1f68616ac8c4f26ca6cacf85023f210f7a453c002ca9159eb42252470b503386",
"from_ocr": false,
},
],
"18047": [
{
"source_fp": "../Franklin County, Indiana.pdf",
"source": "https://www.franklincounty.in.gov/wp-content/uploads/2023/05/80.06.06-Commercial-and-Intermediate-Energy-Systems.pdf",
"date": [2023, 5, null], // Same as above...
"check_if_legal_doc": false,
"checksum": "sha256:6ff5f90301ffba6ac4a8dd4d629201fe7f5cbffa7c5ae6fc8951e978d11be1fa",
"from_ocr": false,
}
],
}
Updating COMPASS Run Config#
Once the local document mapping is compiled, you can point COMPASS to it via the main run config. You will also need to disable search so that COMPASS doesn’t attempt to retrieve documents from the web in addition to processing your local files. The rest of the config can be set up as a typical COMPASS run config with out_dir, tech, and any other relevant settings. Below is a simple example:
{
// Same as a typical COMPASS config
"out_dir": "./outputs",
"jurisdiction_fp": "./jurisdictions.csv",
"tech": "wind",
// NEW: Point to local docs mapping
"known_local_docs": "./local_docs.json5",
// NEW: Disable web search since we already have local docs
"perform_se_search": false,
"perform_website_search": false
}
Note
If you are not sure whether your local docs contain the relevant information to be extracted, you can leave the web search enabled and COMPASS will default back to a web search if no structured data is extracted from the local documents.
Of course, your jurisdiction CSV should still be set up to match the jurisdictions you would like to process:
County,State
Decatur,Indiana
Franklin,Indiana
In this way, you can build up a corpus of local docs, point your config to the document mapping, and only ever process the jurisdiction(s) you are interested in.
Running COMPASS#
Once everything is configured, you can execute a model run as described in the COMPASS Execution Basics:
compass process -c config.json5
If you are using pixi:
pixi run compass process -c config.json5
Outputs are written under ./outputs by default.