reVeal characterize#

Execute the characterize step from a config file.

Characterize a vector grid based on specified raster and vector datasets. Outputs a new GeoPackage containing the input grid with added attributes for the user-specified characterizations.

The general structure for calling this CLI command is given below (add --help to print help info to the terminal).

Usage

reVeal characterize [OPTIONS]

Options

-c, --config_file <config_file>#

Required Path to the characterize configuration file. Below is a sample template config

{
    "execution_control": {
        "option": "local",
        "allocation": "[REQUIRED IF ON HPC]",
        "walltime": "[REQUIRED IF ON HPC]",
        "qos": "normal",
        "memory": null,
        "queue": null,
        "feature": null,
        "conda_env": null,
        "module": null,
        "sh_script": null,
        "keep_sh": false,
        "num_test_nodes": null,
        "max_workers": null
    },
    "log_directory": "./logs",
    "log_level": "INFO",
    "data_dir": "[REQUIRED]",
    "grid": "[REQUIRED]",
    "characterizations": "[REQUIRED]",
    "expressions": "[REQUIRED]"
}

Parameters#

execution_controldict

Dictionary containing execution control arguments. Allowed arguments are:

option:

({‘local’, ‘kestrel’, ‘eagle’, ‘awspc’, ‘slurm’, ‘peregrine’}) Hardware run option. Determines the type of job scheduler to use as well as the base AU cost. The “slurm” option is a catchall for HPC systems that use the SLURM scheduler and should only be used if desired hardware is not listed above. If “local”, no other HPC-specific keys in are required in execution_control (they are ignored if provided).

allocation:

(str) HPC project (allocation) handle.

walltime:

(int) Node walltime request in hours.

qos:

(str, optional) Quality-of-service specifier. For Kestrel users: This should be one of {‘standby’, ‘normal’, ‘high’}. Note that ‘high’ priority doubles the AU cost. By default, "normal".

memory:

(int, optional) Node memory max limit (in GB). By default, None, which uses the scheduler’s default memory limit. For Kestrel users: If you would like to use the full node memory, leave this argument unspecified (or set to None) if you are running on standard nodes. However, if you would like to use the bigmem nodes, you must specify the full upper limit of memory you would like for your job, otherwise you will be limited to the standard node memory size (250GB).

max_workers:

([int, NoneType], optional) Maximum number of workers to use for multiprocessing when running applicable methods in parallel. By default None, will use all available workers for applicable methods. Note that this value will only be applied to characterizations where max_workers is not specified at the characterization-level configuration.

queue:

(str, optional; PBS ONLY) HPC queue to submit job to. Examples include: ‘debug’, ‘short’, ‘batch’, ‘batch-h’, ‘long’, etc. By default, None, which uses “test_queue”.

feature:

(str, optional) Additional flags for SLURM job (e.g. “-p debug”). By default, None, which does not specify any additional flags.

conda_env:

(str, optional) Name of conda environment to activate. By default, None, which does not load any environments.

module:

(str, optional) Module to load. By default, None, which does not load any modules.

sh_script:

(str, optional) Extra shell script to run before command call. By default, None, which does not run any scripts.

keep_sh:

(bool, optional) Option to keep the HPC submission script on disk. Only has effect if executing on HPC. By default, False, which purges the submission scripts after each job is submitted.

num_test_nodes:

(str, optional) Number of nodes to submit before terminating the submission process. This can be used to test a new submission configuration without submitting all nodes (i.e. only running a handful to ensure the inputs are specified correctly and the outputs look reasonable). By default, None, which submits all node jobs.

Only the option key is required for local execution. For execution on the HPC, the allocation and walltime keys are also required. All other options are populated with default values, as seen above.

log_directorystr

Path to directory where logs should be written. Path can be relative and does not have to exist on disk (it will be created if missing). By default, "./logs".

log_level{“DEBUG”, “INFO”, “WARNING”, “ERROR”}

String representation of desired logger verbosity. Suitable options are DEBUG (most verbose), INFO (moderately verbose), WARNING (only log warnings and errors), and ERROR (only log errors). By default, "INFO".

data_dirstr

Path to parent directory containing all geospatial raster and vector datasets to be used for grid characterization.

gridstr

Path to gridded vector dataset for which characterization will be performed. Must be an existing vector polygon dataset in a format that can be opened by pyogrio. Does not strictly need to be a grid, but some functionality may not work if it is not.

characterizationsdict

Characterizations to be performed. Must be a dictionary keyed by the name of the output attribute for each characterization. Each value must be another dictionary with the following keys:

  • dset: String indicating relative path within data_dir to dataset to be characterized.

  • method: String indicating characterization method to be performed. Refer to reVeal.config.characterize.VALID_CHARACTERIZATION_METHODS.

  • attribute: Attribute to summarize. Only required for certain methods. Default is None/null.

  • weights_dset: String indicating relative path within data_dir to dataset to be used as weights. Only applies to characterization methods for rasters; ignored otherwise.

  • neighbor_order: Integer indicating the order of neighbors to include in the characterization of each grid cell. For example, neighbor_order = 1 would result in included first-order queen’s case neighbors. Optional, default is 0, which does not include neighbors.

  • buffer_distance: Float indicating buffer distance to apply in the characterization of each grid cell. Units are based on the CRS of the input grid dataset. For instance, a value of 500 in CRS EPGS:5070 would apply a buffer of 500m to each grid cell before characterization. Optional, default is 0, which does not apply a buffer.

  • parallel: Boolean indicating whether to run the characterization in parallel. This method is only applicable to methods specified as supports_parallel in reVeal.config.VALID_CHARACTERIZATION_METHODS. Default is True, which will run applicable method in parallel and have no effect for other methods. This value should only be changed to False for small input grids, where the performance overhead of setting up parallel processing will outweigh the speedup of running operations in parallel. As a general rule of thumb, as long as the number of grid cells in your grid is an order of magnitude larger than the number of cores available, using parallel=True should yield improved performance.

  • max_workers: Integer indicating the number of workers to use for parallel processing. Will only be applied to methods that support parallel processing. This input will take precedence over the top-level max_workers from the execution_control block (if any). If neither are specified, all available workers will be used for parallel processing.

expressionsdict

Additional expressions to be calculated. Must be a dictionary by the name of the output attribute for each expression. Each value must be a string indicating the expression to be calculated. Expression strings can reference one or more attributes/keys referenced in the characterizations dictionary.

Note that you may remove any keys with a null value if you do not intend to update them yourself.