sup3r.preprocessing.samplers.base.Sampler#

class Sampler(data, sample_shape: tuple | None = None, batch_size: int = 16, feature_sets: dict | None = None, proxy_obs_kwargs: dict | None = None, mode: str = 'lazy')[source]#

Bases: Container

Basic Sampler class for iterating through batches of samples from the contained data.

Parameters:
  • data (Union[Sup3rX, Sup3rDataset],) – Object with data that will be sampled from. Usually the .data attribute of various Container objects. i.e. Loader, Rasterizer, Deriver, as long as the spatial dimensions are not flattened.

  • sample_shape (tuple) – Size of arrays to sample from the contained data.

  • batch_size (int) – Number of samples to get to build a single batch. A sample of (sample_shape[0], sample_shape[1], batch_size * sample_shape[2]) is first selected from underlying dataset and then reshaped into (batch_size, *sample_shape) to get a single batch. This is more efficient than getting N = batch_size samples and then stacking.

  • feature_sets (Optional[dict]) – Optional dictionary describing how the full set of features is split between lr_features, hr_exo_features, and hr_out_features.

    lr_featureslist | tuple

    List of feature names or patt*erns to use as low-resolution model inputs. If no entry is provided then all available features from the data will be used.

    hr_out_featureslist | tuple

    List of feature names or patt*erns that should be output by the generative model and available as ground truth targets. If no entry is provided then all features in lr_features will be used.

    hr_exo_featureslist | tuple

    List of feature names or patt*erns that should be available as high-resolution model inputs (like topography or observations) or for bespoke loss functions. Features used as inputs are injected into the model mid-network to condition output on high-resolution information. The model configuration should have the appropriate layers to use these features. e.g. Sup3rConcat for topography injection, Sup3rObsModel or Sup3rCrossAttention for obs injection. If no entry is provided then hr_exo_features will be empty.

    *To include sparse features as inputs or targets the features must have an “_obs” suffix.

  • proxy_obs_kwargs (dict | None) – Optional dictionary of keyword arguments to pass to the proxy observation generator. This is only used when training with proxy observations. Keys can include onshore_obs_frac, offshore_obs_frac, and perturbation_scale.

    perturbation_scalefloat

    Scale of the perturbation to add to the proxy observations when using proxy observations. This specifies the multiplier of the noise sampled from (-standard deviation, standard deviation). The standdard deviation is calculated per feature over each batch.

    onshore_obs_fracfloat | dict

    Fraction of onshore observations to include in each batch when using proxy observations. This can be a single float or a dictionary with keys ‘spatial’ and ‘temporal’ to specify the fraction for each domain. If a dictionary is provided, the actual fraction for each batch will be sampled uniformly between the specified spatial and temporal fractions.

    offshore_obs_fracfloat | dict

    Fraction of offshore observations to include in each batch when using proxy observations. This can be a single float or a dictionary with keys ‘spatial’ and ‘temporal’ to specify the fraction for each domain. If a dictionary is provided, the actual fraction for each batch will be sampled uniformly between the specified spatial and temporal fractions.

  • mode (str) – Mode for sampling data. Options are ‘lazy’ or ‘eager’. ‘eager’ mode pre-loads all data into memory as numpy arrays for faster access. ‘lazy’ mode samples directly from the underlying data object, which could be backed by dask arrays or on-disk netCDF files.

Methods

check_feature_consistency()

Check that the feature sets are consistent with each other and the obs features are configured correctly.

check_proxy_obs_consistency()

Check that the obs features are configured correctly for proxy observations.

get_sample_index([n_obs])

Randomly gets spatiotemporal sample index.

post_init_log([args_dict])

Log additional arguments after initialization.

preflight()

Perform shape and feature checks.

wrap(data)

Return a Sup3rDataset object or tuple of such.

Attributes

timer

data

Return underlying data.

hr_exo_features

Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection.

hr_features

List of feature names or patt*erns that the model is shown at high-resolution.

hr_features_ind

Get the high-resolution feature channel indices that should be included for loss calculations.

hr_out_features

List of feature names or patt*erns that should be output by the generative model.

hr_sample_shape

Shape of the data sample to select when __next__() is called.

hr_source_features

List of feature names or patt*erns that should be available natively as high-resolution.

lr_features

List of feature names or patt*erns to use as low-resolution model inputs.

lr_features_ind

Get the low-resolution feature channel indices that should be included for training.

obs_features

List of feature names or patt*erns that should be treated as observations.

obs_features_ind

Get the source feature indices in features for each obs feature.

offshore_obs_frac

Fraction of offshore observations to include in each batch when using proxy observations.

onshore_obs_frac

Fraction of onshore observations to include in each batch when using proxy observations.

perturbation_scale

Scale of the perturbation to add to the proxy observations when using proxy observations.

sample_shape

Shape of the data sample to select when __next__() is called.

shape

Get shape of underlying data.

use_proxy_obs

Whether to use proxy observations.

property use_proxy_obs#

Whether to use proxy observations. When True, proxy observation features are generated by masking the corresponding gridded ground truth data and are appended to the samples. The obs features are specified by the obs_features argument and should have a corresponding source feature in the data features that is used for sampling. For example, an obs feature named temperature_obs would be generated from the gridded ground truth feature named temperature.

property onshore_obs_frac#

Fraction of onshore observations to include in each batch when using proxy observations. This can be a single float or a dictionary with keys ‘spatial’ and ‘temporal’ to specify the fraction for each domain. If a dictionary is provided, the actual fraction for each batch will be sampled uniformly between the specified spatial and temporal fractions.

property offshore_obs_frac#

Fraction of offshore observations to include in each batch when using proxy observations. This can be a single float or a dictionary with keys ‘spatial’ and ‘temporal’ to specify the fraction for each domain. If a dictionary is provided, the actual fraction for each batch will be sampled uniformly between the specified spatial and temporal fractions.

property perturbation_scale#

Scale of the perturbation to add to the proxy observations when using proxy observations. This specifies the multiplier of the noise sampled from (-standard deviation, standard deviation).

get_sample_index(n_obs=None)[source]#

Randomly gets spatiotemporal sample index.

Returns:

sample_index (tuple) – Tuple of latitude slice, longitude slice, time slice, and features. Used to get single observation like self.data[sample_index]

Notes

If n_obs > 1 this will get a time slice with n_obs * self.sample_shape[2] time steps, which will then be reshaped into n_obs samples each with self.sample_shape[2] time steps. This is a much more efficient way of getting batches of samples but only works if there are enough continuous time steps to sample.

preflight()[source]#

Perform shape and feature checks.

check_proxy_obs_consistency()[source]#

Check that the obs features are configured correctly for proxy observations.

check_feature_consistency()[source]#

Check that the feature sets are consistent with each other and the obs features are configured correctly.

property sample_shape: tuple#

Shape of the data sample to select when __next__() is called.

property hr_sample_shape: tuple#

Shape of the data sample to select when __next__() is called. Same as sample_shape

property lr_features#

List of feature names or patt*erns to use as low-resolution model inputs. If no entry is provided then all available features from the data will be used.

property hr_source_features#

List of feature names or patt*erns that should be available natively as high-resolution. For a non-dual sampler this is all features, since even features only provided to the model as low-resolution still need to be coarsened from the high-resolution data. This is in contrast to dual samplers (DualSampler), where there are separate high-resolution and low-resolution data members.

property hr_features#

List of feature names or patt*erns that the model is shown at high-resolution. This does not include features that are only shown to the model after coarsening. Thus, this includes hr_out_features and and hr_exo_features.

property hr_out_features#

List of feature names or patt*erns that should be output by the generative model. If no entry is provided then all features in hr_features will be used.

property hr_exo_features#

Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection. These must come at the end of the high-res feature set. These can also be input to the model as low-res features.

property obs_features#

List of feature names or patt*erns that should be treated as observations. These features will be included in the high-res data but not the low-res data and won’t necessarily be expected to be output by the generative model. These are different from other hr_exo_features in that they are intended to be used as observation features with NaN values where observations are not available.

property hr_features_ind#

Get the high-resolution feature channel indices that should be included for loss calculations. This includes hr_out_features and hr_exo_features, Any high-resolution features that are only included in the data handler to be coarsened for the low-res input are removed.

property lr_features_ind#

Get the low-resolution feature channel indices that should be included for training. This includes lr_features.

property data#

Return underlying data.

Returns:

Sup3rDataset

See also

wrap()

property obs_features_ind#

Get the source feature indices in features for each obs feature. Each obs feature named <feature>_obs maps to the corresponding <feature> in the features.

Returns:

list[int] – Indices into features for each obs feature source.

post_init_log(args_dict=None)#

Log additional arguments after initialization.

property shape#

Get shape of underlying data.

wrap(data)#

Return a Sup3rDataset object or tuple of such. This is a tuple when the .data attribute belongs to a Collection object like BatchHandler. Otherwise this is Sup3rDataset object, which is either a wrapped 3-tuple, 2-tuple, or 1-tuple (e.g. len(data) == 3, len(data) == 2 or len(data) == 1). This is a 3-tuple when .data belongs to a container object like DualSamplerWithObs, a 2-tuple when .data belongs to a dual container object like DualSampler, and a 1-tuple otherwise.