compass.pipeline.data_classes.ExtractionRequest#
- class ExtractionRequest(out_dir, tech, jurisdiction_fp, collection_manifest_fp, *, model='gpt-4o-mini', max_num_concurrent_jurisdictions=25, file_loader_kwargs=None, td_kwargs=None, tpe_kwargs=None, ppe_kwargs=None, log_dir=None, clean_dir=None, ordinance_file_dir=None, jurisdiction_dbs_dir=None, llm_costs=None, log_level='INFO', keep_async_logs=False)[source]#
Bases:
BaseRequestParameter Object for extraction mode
- Parameters:
out_dir (path-like) – Path to the output directory. If it does not exist, it will be created. This directory will contain the saved collection manifest, downloaded ordinance documents, parsed document text, usage metadata, and default subdirectories for logs and intermediate outputs (unless otherwise specified).
tech (
str) – Label indicating which technology type is being processed. Must be one of the keys ofPLUGIN_REGISTRY.jurisdiction_fp (path-like) – Path to a CSV file specifying the jurisdictions to process. The CSV must contain at least two columns: “County” and “State”, which specify the county and state names, respectively. If you would like to process a subdivision with a county, you must also include “Subdivision” and “Jurisdiction Type” columns. The “Subdivision” should be the name of the subdivision, and the “Jurisdiction Type” should be a string identifying the type of subdivision (e.g., “City”, “Township”, etc.)
collection_manifest_fp (path-like) – Path to the JSON collection manifest created by the document collection step. The manifest must contain the persisted document information needed to reload each collected document for extraction.
model (
strorlistofdict, default"gpt-4o-mini") –LLM model(s) to use for scraping and parsing ordinance documents. If a string is provided, it is assumed to be the name of the default model (e.g., “gpt-4o”), and environment variables are used for authentication.
If a list is provided, it should contain dictionaries of arguments that can initialize instances of
OpenAIConfig. Each dictionary can specify the model name, client type, and initialization arguments.Each dictionary must also include a
taskskey, which maps to a string or list of strings indicating the tasks that instance should handle. Exactly one of the instances must include “default” as a task, which will be used when no specific task is matched. For example:"model": [ { "model": "gpt-4o-mini", "llm_call_kwargs": { "temperature": 0, "timeout": 300, }, "client_kwargs": { "api_key": "<your_api_key>", "api_version": "<your_api_version>", "azure_endpoint": "<your_azure_endpoint>", }, "tasks": ["default", "date_extraction"], }, { "model": "gpt-4o", "client_type": "openai", "tasks": ["ordinance_text_extraction"], } ]
Important
You will need to ensure that the model name used here matches your deployment if you are using Azure OpenAI. For example, if you deployed the GPT-4o-mini model under the name
"gpt-4o-mini-2025-04-11", you would want to set"model": "gpt-4o-mini-2025-04-11".By default,
"gpt-4o-mini".max_num_concurrent_jurisdictions (
int, default25) – Maximum number of jurisdictions to process concurrently. Limiting this can help manage memory usage when dealing with a large number of documents. By default,25.file_loader_kwargs (
dict, optional) – Dictionary of keyword argument pairs to initializeelm.web.file_loader.AsyncWebFileLoader. If found, the"pw_launch_kwargs"key in these will also be used to initialize the Playwright-backed Google search used for search engine retrieval. By default,None.td_kwargs (
dict, optional) – Additional keyword arguments to pass totempfile.TemporaryDirectory. The temporary directory is used to store documents which have not yet been confirmed to contain relevant information. By default,None.tpe_kwargs (
dict, optional) – Additional keyword arguments to pass toconcurrent.futures.ThreadPoolExecutor, used for I/O-bound tasks such as logging and file writes. By default,None.ppe_kwargs (
dict, optional) – Additional keyword arguments to pass toconcurrent.futures.ProcessPoolExecutor, used for CPU-bound tasks such as PDF loading and parsing. By default,None.log_dir (path-like, optional) – Path to the directory for storing log files. If not provided, a
logssubdirectory will be created inside out_dir. By default,None.clean_dir (path-like, optional) – Path to the directory for storing cleaned ordinance text output. If not provided, a
cleaned_textsubdirectory will be created inside out_dir. By default,None.ordinance_file_dir (path-like, optional) – Path to the directory where downloaded ordinance files (PDFs or HTML) for each jurisdiction are stored. If not provided, a
ordinance_filessubdirectory will be created inside out_dir. By default,None.jurisdiction_dbs_dir (path-like, optional) – Path to the directory where parsed ordinance database files are stored for each jurisdiction. If not provided, a
jurisdiction_dbssubdirectory will be created inside out_dir. By default,None.llm_costs (
dict, optional) –Dictionary mapping model names to their token costs, used to track the estimated total cost of LLM usage during the run. The structure should be:
{"model_name": {"prompt": float, "response": float}}
Costs are specified in dollars per million tokens. For example:
"llm_costs": {"my_gpt": {"prompt": 1.5, "response": 3}}
registers a model named “my_gpt” with a cost of $1.5 per million input (prompt) tokens and $3 per million output (response) tokens for the current processing run.
Note
The displayed total cost does not track cached tokens, so treat it like an estimate. Your final API costs may vary.
If set to
None, no custom model costs are recorded, and cost tracking may be unavailable in the progress bar. By default,None.log_level (
str, default"INFO") – Logging level for ordinance scraping and parsing (e.g., “TRACE”, “DEBUG”, “INFO”, “WARNING”, or “ERROR”). By default,"INFO".keep_async_logs (
bool, defaultFalse) – Option to store the full asynchronous log record to a file. This is only useful if you intend to monitor overall processing progress from a file instead of from the terminal. IfTrue, all of the unordered records are written to a “all.log” file in the log_dir directory. By default,False.
Methods
Attributes
COMPASSRunMode associated with this request type
Mapping of LLM task to OpenAIConfig for this request
- MODE = 'extract'#
COMPASSRunMode associated with this request type