compass.extraction.apply.check_for_relevant_text#
- async check_for_relevant_text(doc, model_config, heuristic, tech, text_collectors, usage_tracker=None, min_chunks_to_process=3)[source]#
Parse a single document for relevant text (e.g. ordinances)
The results of the text parsing are stored in the documents attrs under the respective text collector label.
- Parameters:
doc (
elm.web.document.BaseDocument) – A document instance (PDF, HTML, etc) potentially containing ordinance information. Note that if the document’sattrshas the relevant text output, the corresponding text collector will not be run. To force a document to be processed by this function, remove all previously collected text from the document’sattrs.model_config (
compass.llm.config.LLMConfig) – Configuration describing which LLM service, splitter, and call parameters should be used for extraction.heuristic (
object) – Domain-specific heuristic implementing acheckmethod to qualify text chunks for further processing.tech (
str) – Technology of interest (e.g. “solar”, “wind”, etc). This is used to set up some document validation decision trees.text_collectors (
Iterable) – Iterable of text collector classes to run during document parsing. Each class must implement thecompass.plugin.interface.BaseTextCollectorinterface. If the document already contains text collected by a given collector (i.e. the collector’sOUT_LABELis found indoc.attrs), that collector will be skipped.usage_tracker (
UsageTracker, optional) – Optional tracker instance to monitor token usage during LLM calls. By default,None.min_chunks_to_process (
int, optional) – Minimum number of chunks to process before aborting due to text failing the heuristic or deemed not legal (if applicable). By default,3.
- Returns:
bool–Trueif any text was collected by any of the text collectors andFalseotherwise.
Notes
The function updates progress bar logging as chunks are processed.