mlclouds.grid_searcher.GridSearcher

class GridSearcher(output_ws, exe_fpath, data_root='/eaglefs/projects/mlclouds/data_surfrad_9/', conda_env='mlclouds', number_hidden_layers=(3,), number_hidden_nodes=(64,), dropouts=(0.01,), learning_rates=(0.001,), loss_weights_b=([0.5, 0.5],), test_fractions=(0.2,), epochs_a=(10,), epochs_b=(10,), n_batches=(16,), base_config={'epochs_a': 10, 'epochs_b': 10, 'features': ['solar_zenith_angle', 'cloud_type', 'refl_0_65um_nom', 'refl_0_65um_nom_stddev_3x3', 'refl_3_75um_nom', 'temp_3_75um_nom', 'temp_11_0um_nom', 'temp_11_0um_nom_stddev_3x3', 'cloud_probability', 'cloud_fraction', 'air_temperature', 'dew_point', 'relative_humidity', 'total_precipitable_water', 'surface_albedo'], 'hidden_layers': [{'activation': 'relu', 'dropout': 0.01, 'name': 'relu1', 'units': 64}, {'activation': 'relu', 'dropout': 0.01, 'name': 'relu2', 'units': 64}, {'activation': 'relu', 'dropout': 0.01, 'name': 'relu3', 'units': 64}], 'learning_rate': 0.001, 'loss_weights_a': [1, 0], 'loss_weights_b': [0.5, 0.5], 'metric': 'relative_mae', 'n_batch': 16, 'one_hot_categories': {'flag': ['clear', 'ice_cloud', 'water_cloud', 'bad_cloud']}, 'p_fun': 'p_fun_all_sky', 'p_kwargs': {'loss_terms': ['mae_ghi', 'mae_dni', 'mbe_ghi', 'mbe_dni']}, 'phygnn_seed': 0, 'surfrad_window_minutes': 15, 'training_prep_kwargs': {'filter_clear': False, 'nan_option': 'interp'}, 'y_labels': ['cld_opd_dcomp', 'cld_reff_dcomp']})[source]

Bases: object

Perform a grid search over provided model hyperparmaters.

Parameters:

output_ws (str) – Filepath to folder used for config file and output files storage. Must have write access.
exe_fpath (str) – Filepath to ‘run_mlclouds.py’.
data_root (str) – Filepath to surfrad data root. Defaults to ‘/eaglefs/projects/mlclouds/data_surfrad_9/’.
conda_env (str) – Anaconda environment for HPC jobs. Defaults to mlclouds.
number_hidden_layers (list of int) – Number of fully-connected, relu activated model layers to compile. <dropouts> values are applied. Defaults to (3, ).
number_hidden_nodes (list of int) – Layer depth for each hidden layer in <number_hidden_layers>. Defaults to (64, ).
dropouts (list of float) – Droput rates applied to each layer. Should be between 0 and 1. Defaults to (0.01, ).
learning_rates (list of float) – Model learning rates. Defaults to (0.001, ).
loss_weights_b (list of list of float) – Loss function weights applied in second round of training. First weight applies to MSE, second to physics loss function. Should sum to 1. Defaults to ([0.5, 0.5], ).
test_fractions (list of float) – Fraction of training samples to be withheld for testing. Should be between 0 and 1. Defaults to (0.2, ).
epochs_a (list of int) – Number of epochs to train without physics loss function applied. Defaults to (10, 0).
epochs_b (list of int) – Number of epochs to train with physcs loss function applied. Defaults to (10, ).
n_batches (list of int) – Training batch sizes. Defaults to (16, 0).
base_config –

Base configuration for model. Defaults to:

{“surfrad_window_minutes”: 15,

“features”: [“solar_zenith_angle”,
“cloud_type”, “refl_0_65um_nom”, “refl_0_65um_nom_stddev_3x3”, “refl_3_75um_nom”, “temp_3_75um_nom”, “temp_11_0um_nom”, “temp_11_0um_nom_stddev_3x3”, “cloud_probability”, “cloud_fraction”, “air_temperature”, “dew_point”, “relative_humidity”, “total_precipitable_water”, “surface_albedo” ],

“y_labels”: [“cld_opd_dcomp”, “cld_reff_dcomp”], “hidden_layers”: [{“units”: 64, “activation”: “relu”,

“name”: “relu1”, “dropout”: 0.01},

{“units”: 64, “activation”: “relu”,
“name”: “relu2”, “dropout”: 0.01},

{“units”: 64, “activation”: “relu”,
“name”: “relu3”, “dropout”: 0.01}

],

“phygnn_seed”: 0, “metric”: “relative_mae”, “learning_rate”: 1e-3, “n_batch”: 16, “epochs_a”: 10, “epochs_b”: 10, “loss_weights_a”: [1, 0], “loss_weights_b”: [0.5, 0.5], “p_kwargs”: {“loss_terms”: [“mae_ghi”, “mae_dni”, “mbe_ghi”,

“mbe_dni”]},

“p_fun”: “p_fun_all_sky”, “training_prep_kwargs”: {“filter_clear”: False,

“nan_option”: “interp”},

“one_hot_categories”: {“flag”: [“clear”, “ice_cloud”,
“water_cloud”, “bad_cloud”]}

}

Methods

`collect_results`([fpath])	Collect and save training metrics for each successful job in self.jobs.
`jobs_status`()	Query SLURM queue for active jobs for current user.
`run_grid_search`([dry_run, walltime])	Start an HPC job for each job in self.jobs.
`start_job`(number_hidden_layers, ...[, ...])	Start a single HPC task for a single model run via run_mlclouds.py.

Attributes

output_ws

Filepath of folder to contain output files.

property output_ws

Filepath of folder to contain output files.

Returns:: output_ws (str) – Filepath of folder to contain output files.

start_job(number_hidden_layers, number_hidden_nodes, dropout, learning_rate, loss_weights_b, test_fraction, epochs_a, epochs_b, n_batch, run_id='0', walltime=1)[source]

Start a single HPC task for a single model run via run_mlclouds.py.

Parameters:

number_hidden_layers (int) – Number of fully-connected, relu activated model layers to compile. <dropout> value is applied.
number_hidden_nodes (int) – Layer depth for each hidden layer in <number_hidden_layers>.
dropout (float) – Droput rate applied to each layer. Should be between 0 and 1.
learning_rate (float) – Model learning rate.
loss_weights_b (list of float) – Loss function weights applied in second round of training. First weight applies to MSE, second to physics loss function. Should sum to 1.
test_fraction (float) – Fraction of training samples to be withheld for testing. Should be between 0 and 1.
epochs_a (int) – Number of epochs to train without physics loss function applied.
epochs_b (int) – Number of epochs to train with physcs loss function applied.
n_batch (int) – Training batch size.
run_id (str) – Run ID number. Defaults to 0.
walltime (int) – HPC job walltime in hours. Deafults to 1 hour.

run_grid_search(dry_run=False, walltime=1)[source]

Start an HPC job for each job in self.jobs.

Parameters:

dry_run (bool) – Prepare runs without executing. Defaults to False.
walltime (int) – HPC job walltime in hours. Defaults to 1.

jobs_status()[source]

Query SLURM queue for active jobs for current user.

Returns:: status (str) – SLURM status for all active jobs for current user.

collect_results(fpath=None)[source]

Collect and save training metrics for each successful job in self.jobs.

Parameters:: fpath (str) – Output file path of results CSV. Pass None to skip saving.
Returns:: results (DataFrame) – Pandas DataFrame with columns: epoch, elapsed_time, training_loss, validation_loss, number_hidden_layers, number_hidden_nodes, dropout, learning_rate, loss_weights_b, test_fraction, epochs_a, epochs_b, n_batch.