mlclouds.data_handlers.TrainData

class TrainData(train_files, train_sites='all', config={'epochs_a': 100, 'epochs_b': 90, 'features': ['solar_zenith_angle', 'cloud_type', 'refl_0_65um_nom', 'refl_0_65um_nom_stddev_3x3', 'refl_3_75um_nom', 'temp_3_75um_nom', 'temp_11_0um_nom', 'temp_11_0um_nom_stddev_3x3', 'cloud_probability', 'cloud_fraction', 'air_temperature', 'dew_point', 'relative_humidity', 'total_precipitable_water', 'surface_albedo'], 'hidden_layers': [{'activation': 'relu', 'dropout': 0.1, 'units': 256}, {'activation': 'relu', 'dropout': 0.1, 'units': 256}, {'activation': 'relu', 'dropout': 0.1, 'units': 256}, {'activation': 'relu', 'dropout': 0.1, 'units': 256}, {'activation': 'relu', 'dropout': 0.1, 'units': 256}], 'learning_rate': 0.0005, 'loss_weights_a': [1, 0], 'loss_weights_b': [0.5, 0.5], 'metric': 'relative_mae', 'n_batch': 64, 'one_hot_categories': {'flag': ['clear', 'ice_cloud', 'water_cloud', 'bad_cloud']}, 'p_fun': 'p_fun_all_sky', 'p_kwargs': {'loss_terms': ['mae_ghi']}, 'phygnn_seed': 0, 'surfrad_window_minutes': 15, 'training_prep_kwargs': {'add_cloud_flag': True, 'filter_clear': False, 'filter_daylight': True, 'filter_sky_class': False, 'nan_option': 'interp', 'sza_lim': 89}, 'y_labels': ['cld_opd_dcomp', 'cld_reff_dcomp']}, test_fraction=None, nsrdb_files=None, cache_pattern=None)[source]

Bases: object

Load and prep training data

Parameters:
  • train_files (list | str) – File or list of files to use for training. Filenames must include the four-digit year.

  • train_sites (‘all’ | list of int) – Surfrad gids to use for training. Use all if ‘all’

  • config (dict) – Dict of configuration options. See CONFIG for example.

  • test_fraction (None | float) – Fraction of full data set to reserve for testing. Should be between 0 to 1. The test set is randomly selected and dropped from the training set. If None, do not reserve a test set.

  • nsrdb_files (list) – Nsrdb files including irradiance data for the training sites. This is used to compute the sky class for these locations which is then used to filter cloud type data for false positives / negatives. Each file needs to have a four digit year and east / west label.

  • cache_pattern (str) – File path pattern for saving training data. e.g. ./df_{}.csv. This will be used to save self.x, self.y, and self.p

Methods

cache_exists(cache_pattern)

Check if cache files for df_raw and df_all_sky exist.

load_all_data(fp_pattern)

Load all df_raw / df_all_sky from csv files.

save_all_data(fp_pattern)

Save all raw / all_sky data to disk

static cache_exists(cache_pattern)[source]

Check if cache files for df_raw and df_all_sky exist.

save_all_data(fp_pattern)[source]

Save all raw / all_sky data to disk

Parameters:

fp_pattern (str) – .csv filepath pattern to save data to. e.g. ./df_{}.csv

load_all_data(fp_pattern)[source]

Load all df_raw / df_all_sky from csv files.

Parameters:

fp_pattern (str) – .csv filepath pattern to load data from. e.g. ./df_{}.csv