mlclouds.grid_searcher.GridSearcher

class GridSearcher(output_ws, exe_fpath, data_root='/eaglefs/projects/mlclouds/data_surfrad_9/', conda_env='mlclouds', number_hidden_layers=(3,), number_hidden_nodes=(64,), dropouts=(0.01,), learning_rates=(0.001,), loss_weights_b=([0.5, 0.5],), test_fractions=(0.2,), epochs_a=(10,), epochs_b=(10,), n_batches=(16,), base_config={'epochs_a': 10, 'epochs_b': 10, 'features': ['solar_zenith_angle', 'cloud_type', 'refl_0_65um_nom', 'refl_0_65um_nom_stddev_3x3', 'refl_3_75um_nom', 'temp_3_75um_nom', 'temp_11_0um_nom', 'temp_11_0um_nom_stddev_3x3', 'cloud_probability', 'cloud_fraction', 'air_temperature', 'dew_point', 'relative_humidity', 'total_precipitable_water', 'surface_albedo'], 'hidden_layers': [{'activation': 'relu', 'dropout': 0.01, 'name': 'relu1', 'units': 64}, {'activation': 'relu', 'dropout': 0.01, 'name': 'relu2', 'units': 64}, {'activation': 'relu', 'dropout': 0.01, 'name': 'relu3', 'units': 64}], 'learning_rate': 0.001, 'loss_weights_a': [1, 0], 'loss_weights_b': [0.5, 0.5], 'metric': 'relative_mae', 'n_batch': 16, 'one_hot_categories': {'flag': ['clear', 'ice_cloud', 'water_cloud', 'bad_cloud']}, 'p_fun': 'p_fun_all_sky', 'p_kwargs': {'loss_terms': ['mae_ghi', 'mae_dni', 'mbe_ghi', 'mbe_dni']}, 'phygnn_seed': 0, 'surfrad_window_minutes': 15, 'training_prep_kwargs': {'filter_clear': False, 'nan_option': 'interp'}, 'y_labels': ['cld_opd_dcomp', 'cld_reff_dcomp']})[source]

Bases: object

Perform a grid search over provided model hyperparmaters.

Parameters:
  • output_ws (str) – Filepath to folder used for config file and output files storage. Must have write access.

  • exe_fpath (str) – Filepath to ‘run_mlclouds.py’.

  • data_root (str) – Filepath to surfrad data root. Defaults to ‘/eaglefs/projects/mlclouds/data_surfrad_9/’.

  • conda_env (str) – Anaconda environment for HPC jobs. Defaults to mlclouds.

  • number_hidden_layers (list of int) – Number of fully-connected, relu activated model layers to compile. <dropouts> values are applied. Defaults to (3, ).

  • number_hidden_nodes (list of int) – Layer depth for each hidden layer in <number_hidden_layers>. Defaults to (64, ).

  • dropouts (list of float) – Droput rates applied to each layer. Should be between 0 and 1. Defaults to (0.01, ).

  • learning_rates (list of float) – Model learning rates. Defaults to (0.001, ).

  • loss_weights_b (list of list of float) – Loss function weights applied in second round of training. First weight applies to MSE, second to physics loss function. Should sum to 1. Defaults to ([0.5, 0.5], ).

  • test_fractions (list of float) – Fraction of training samples to be withheld for testing. Should be between 0 and 1. Defaults to (0.2, ).

  • epochs_a (list of int) – Number of epochs to train without physics loss function applied. Defaults to (10, 0).

  • epochs_b (list of int) – Number of epochs to train with physcs loss function applied. Defaults to (10, ).

  • n_batches (list of int) – Training batch sizes. Defaults to (16, 0).

  • base_config

    Base configuration for model. Defaults to:

    {“surfrad_window_minutes”: 15,
    “features”: [“solar_zenith_angle”,

    “cloud_type”, “refl_0_65um_nom”, “refl_0_65um_nom_stddev_3x3”, “refl_3_75um_nom”, “temp_3_75um_nom”, “temp_11_0um_nom”, “temp_11_0um_nom_stddev_3x3”, “cloud_probability”, “cloud_fraction”, “air_temperature”, “dew_point”, “relative_humidity”, “total_precipitable_water”, “surface_albedo” ],

    “y_labels”: [“cld_opd_dcomp”, “cld_reff_dcomp”], “hidden_layers”: [{“units”: 64, “activation”: “relu”,

    “name”: “relu1”, “dropout”: 0.01},

    {“units”: 64, “activation”: “relu”,

    “name”: “relu2”, “dropout”: 0.01},

    {“units”: 64, “activation”: “relu”,

    “name”: “relu3”, “dropout”: 0.01}

    ],

    “phygnn_seed”: 0, “metric”: “relative_mae”, “learning_rate”: 1e-3, “n_batch”: 16, “epochs_a”: 10, “epochs_b”: 10, “loss_weights_a”: [1, 0], “loss_weights_b”: [0.5, 0.5], “p_kwargs”: {“loss_terms”: [“mae_ghi”, “mae_dni”, “mbe_ghi”,

    “mbe_dni”]},

    “p_fun”: “p_fun_all_sky”, “training_prep_kwargs”: {“filter_clear”: False,

    “nan_option”: “interp”},

    “one_hot_categories”: {“flag”: [“clear”, “ice_cloud”,

    “water_cloud”, “bad_cloud”]}

    }

Methods

collect_results([fpath])

Collect and save training metrics for each successful job in self.jobs.

jobs_status()

Query SLURM queue for active jobs for current user.

run_grid_search([dry_run, walltime])

Start an HPC job for each job in self.jobs.

start_job(number_hidden_layers, ...[, ...])

Start a single HPC task for a single model run via run_mlclouds.py.

Attributes

output_ws

Filepath of folder to contain output files.

property output_ws

Filepath of folder to contain output files.

Returns:

output_ws (str) – Filepath of folder to contain output files.

start_job(number_hidden_layers, number_hidden_nodes, dropout, learning_rate, loss_weights_b, test_fraction, epochs_a, epochs_b, n_batch, run_id='0', walltime=1)[source]

Start a single HPC task for a single model run via run_mlclouds.py.

Parameters:
  • number_hidden_layers (int) – Number of fully-connected, relu activated model layers to compile. <dropout> value is applied.

  • number_hidden_nodes (int) – Layer depth for each hidden layer in <number_hidden_layers>.

  • dropout (float) – Droput rate applied to each layer. Should be between 0 and 1.

  • learning_rate (float) – Model learning rate.

  • loss_weights_b (list of float) – Loss function weights applied in second round of training. First weight applies to MSE, second to physics loss function. Should sum to 1.

  • test_fraction (float) – Fraction of training samples to be withheld for testing. Should be between 0 and 1.

  • epochs_a (int) – Number of epochs to train without physics loss function applied.

  • epochs_b (int) – Number of epochs to train with physcs loss function applied.

  • n_batch (int) – Training batch size.

  • run_id (str) – Run ID number. Defaults to 0.

  • walltime (int) – HPC job walltime in hours. Deafults to 1 hour.

Start an HPC job for each job in self.jobs.

Parameters:
  • dry_run (bool) – Prepare runs without executing. Defaults to False.

  • walltime (int) – HPC job walltime in hours. Defaults to 1.

jobs_status()[source]

Query SLURM queue for active jobs for current user.

Returns:

status (str) – SLURM status for all active jobs for current user.

collect_results(fpath=None)[source]

Collect and save training metrics for each successful job in self.jobs.

Parameters:

fpath (str) – Output file path of results CSV. Pass None to skip saving.

Returns:

results (DataFrame) – Pandas DataFrame with columns: epoch, elapsed_time, training_loss, validation_loss, number_hidden_layers, number_hidden_nodes, dropout, learning_rate, loss_weights_b, test_fraction, epochs_a, epochs_b, n_batch.