compass.pipeline.collection.base.DocumentCollection#

class DocumentCollection(workflow)[source]#

Bases: object

Workflow object that applies a fixed pipeline of steps

Parameters:

workflow (compass.pipeline.jurisdiction.SingleJurisdictionRun) – The workflow for the jurisdiction being processed, which may or may not have website search enabled. The workflow is passed to each collection step, which may use it to access jurisdiction information and other relevant data, and to determine whether website search is enabled.

Methods

execute(*[, eager_extract, relative_to])

Run the fixed collection sequence

async execute(*, eager_extract=False, relative_to=None)[source]#

Run the fixed collection sequence

The document collection has a well-defined order:

  1. Process any/all known local documents

  2. Process any/all known document URLs

  3. Search engine-based search for ordinance documents

  4. Jurisdiction website crawl-based search for ordinance documents

Users can disable any of these steps via the workflow configuration.

Parameters:
  • eager_extract (bool, optional) – Option to apply extraction as soon as any documents are found. If the extraction returns any structured data, subsequent steps are skipped for that jurisdiction. By default, False.

  • relative_to (path-like, optional) – Optional directory that should be the root of all relative paths. By default, None.

Returns:

dict or None – If eager_extract is False, a dictionary containing collection information and metadata. If eager_extract is True, the result of the extraction workflow if any structured data was extracted, or None if no structured data was extracted from any of the collected documents.