compass.extraction.apply.check_for_relevant_text#

async check_for_relevant_text(doc, model_config, heuristic, tech, text_collectors, usage_tracker=None, min_chunks_to_process=3)[source]#

Parse a single document for relevant text (e.g. ordinances)

The results of the text parsing are stored in the documents attrs under the respective text collector label.

Parameters:
  • doc (elm.web.document.BaseDocument) – A document instance (PDF, HTML, etc) potentially containing ordinance information. Note that if the document’s attrs has the relevant text output, the corresponding text collector will not be run. To force a document to be processed by this function, remove all previously collected text from the document’s attrs.

  • model_config (compass.llm.config.LLMConfig) – Configuration describing which LLM service, splitter, and call parameters should be used for extraction.

  • heuristic (object) – Domain-specific heuristic implementing a check method to qualify text chunks for further processing.

  • tech (str) – Technology of interest (e.g. “solar”, “wind”, etc). This is used to set up some document validation decision trees.

  • text_collectors (Iterable) – Iterable of text collector classes to run during document parsing. Each class must implement the compass.plugin.interface.BaseTextCollector interface. If the document already contains text collected by a given collector (i.e. the collector’s OUT_LABEL is found in doc.attrs), that collector will be skipped.

  • usage_tracker (UsageTracker, optional) – Optional tracker instance to monitor token usage during LLM calls. By default, None.

  • min_chunks_to_process (int, optional) – Minimum number of chunks to process before aborting due to text failing the heuristic or deemed not legal (if applicable). By default, 3.

Returns:

boolTrue if any text was collected by any of the text collectors and False otherwise.

Notes

The function updates progress bar logging as chunks are processed.