Interpreting Q-Q Plots for Uncertainty Calibration¶

A Q-Q plot (quantile-quantile plot) is a diagnostic tool for assessing whether the uncertainty estimates from your Gaussian Process model are well-calibrated. In ALchemist, the Q-Q plot helps you determine if your model's predicted uncertainties accurately reflect the true prediction errors.

What is Uncertainty Calibration?¶

When a Gaussian Process predicts an output value, it also provides a measure of uncertainty (standard deviation). Well-calibrated uncertainty means that:

If the model says there's a 68% chance the true value is within ±1σ, then approximately 68% of predictions should fall within that range
If the model says there's a 95% chance the true value is within ±2σ, then approximately 95% should fall within that range

Calibration is critical for:

Decision-making: Reliable uncertainties help determine when to trust predictions vs. run more experiments
Acquisition functions: Methods like Expected Improvement and UCB rely on accurate uncertainty estimates
Risk assessment: Understanding prediction confidence for safety-critical applications

Understanding the Q-Q Plot¶

What the Plot Shows¶

The Q-Q plot in ALchemist displays:

X-axis: Theoretical quantiles from a standard normal distribution \(\mathcal{N}(0, 1)\)
Y-axis: Standardized residuals (z-scores) from cross-validation predictions
Diagonal reference line: Perfect calibration (y = x)
Confidence band (for small samples, N < 100): Expected deviation range due to finite sample size
Diagnostic metrics: Mean(z) and Std(z) displayed on the plot

Standardized Residuals (Z-scores)¶

For each cross-validation prediction, the z-score is calculated as:

\[ z_i = \frac{y_i^{\text{true}} - y_i^{\text{pred}}}{\sigma_i} \]

Where:

\(y_i^{\text{true}}\) = actual experimental value
\(y_i^{\text{pred}}\) = model prediction
\(\sigma_i\) = predicted standard deviation

If uncertainties are well-calibrated, these z-scores should follow a standard normal distribution \(\mathcal{N}(0, 1)\).

Interpreting the Plot¶

Perfect Calibration 🎯¶

What it looks like:

Points closely follow the diagonal line (y = x)
Mean(z) ≈ 0.0
Std(z) ≈ 1.0
Points within confidence band

What it means:

Model uncertainties accurately reflect prediction errors
68% of predictions within ±1σ, 95% within ±2σ (as expected)
Acquisition functions will make optimal decisions

Example:

Mean(z) = 0.02
Std(z) = 0.98
Status: ✓ Well-calibrated

Over-Confident Predictions¶

What it looks like:

Points systematically above the diagonal line
Std(z) > 1.0 (e.g., 1.5, 2.0, or higher)
Residuals are larger than predicted uncertainties

What it means:

Model is too confident in its predictions
Actual errors are larger than the model thinks they are
The model predicts σ = 2, but actual error is σ = 4
Risk of over-exploiting regions that may not be optimal

Why it happens:

Insufficient model complexity (kernel too simple)
Underestimated noise in the data
Not enough data for the problem complexity
Overfitting to training data

How to fix:

Try a more flexible kernel (e.g., Matern with lower ν)
Increase model noise parameter (if using noise column)
Collect more training data
Apply uncertainty calibration (ALchemist does this automatically)

Example:

Mean(z) = -0.05
Std(z) = 1.45
Status: Over-confident (model uncertainties too small)

Under-Confident Predictions¶

What it looks like:

Points systematically below the diagonal line
Std(z) < 1.0 (e.g., 0.6, 0.7, 0.8)
Residuals are smaller than predicted uncertainties

What it means:

Model is too uncertain about its predictions
The model predicts σ = 4, but actual error is σ = 2
Predictions are more accurate than the model believes
Risk of over-exploring, wasting experiments on unnecessary regions

Why it happens:

Model is overly conservative
Noise parameter set too high
Kernel lengthscales too large (oversmoothing)
Small dataset with conservative priors

How to fix:

Try a more restrictive kernel (e.g., Matern with higher ν)
Reduce model noise parameter
Optimize kernel hyperparameters more aggressively
Collect more data to reduce inherent uncertainty

Example:

Mean(z) = 0.08
Std(z) = 0.72
Status: Under-confident (model uncertainties too large)

Systematic Bias 🔴¶

What it looks like:

Mean(z) significantly different from 0 (e.g., |Mean(z)| > 0.3)
Points shifted up or down from the diagonal
Consistent over- or under-prediction

What it means:

Model has systematic bias in predictions
Not just a calibration issue—predictions are consistently off
Mean(z) > 0: Model consistently under-predicts
Mean(z) < 0: Model consistently over-predicts

How to fix:

Check data quality and units
Try different kernel types
Check for data preprocessing issues
Ensure input/output transforms are appropriate
Consider adding a mean function or trend

Example:

Mean(z) = 0.45
Std(z) = 1.02
Status: 🔴 Systematic bias (consistent under-prediction)

Sample Size Considerations¶

Small Datasets (N < 30)¶

Expect more scatter around the diagonal
Confidence bands are wider
Std(z) can deviate from 1.0 more easily
Don't over-interpret minor deviations

Medium Datasets (30 < N < 100)¶

Narrower confidence bands
More reliable calibration assessment
Moderate deviations indicate real issues

Large Datasets (N > 100)¶

Tight confidence bands
High confidence in calibration assessment
Even small deviations may indicate issues

Practical Guidelines¶

When to Worry 🚨¶

Take action if you see:

Std(z) > 1.3 or < 0.7 (with N > 30)
|Mean(z)| > 0.3
Clear systematic pattern in deviations
Points consistently outside confidence band

When Not to Worry¶

Don't be concerned if:

Minor scatter with Std(z) between 0.9 and 1.1
Mean(z) between -0.1 and 0.1
Points within confidence band (especially for N < 30)
Random scatter without systematic pattern

Relationship to Calibration Curve¶

The Q-Q plot and Calibration Curve are complementary:

Q-Q Plot: Tests if residuals follow normal distribution (are z-scores ~ N(0,1)?)
Calibration Curve: Tests if confidence intervals have correct coverage (do 95% intervals contain 95% of points?)

Use both together for comprehensive uncertainty assessment:

Q-Q plot reveals over/under-confidence and bias
Calibration curve quantifies coverage at specific confidence levels

Summary¶

Observation	Mean(z)	Std(z)	Interpretation	Action
Points on diagonal	≈ 0	≈ 1.0	✓ Well-calibrated	None needed
Points above diagonal	≈ 0	> 1.0	Over-confident	Increase uncertainty
Points below diagonal	≈ 0	< 1.0	Under-confident	Reduce uncertainty
Points shifted up	> 0	any	🔴 Under-predicting	Check data/model
Points shifted down	< 0	any	🔴 Over-predicting	Check data/model

Interpreting Q-Q Plots for Uncertainty Calibration¶

What is Uncertainty Calibration?¶

Understanding the Q-Q Plot¶

What the Plot Shows¶

Standardized Residuals (Z-scores)¶

Interpreting the Plot¶

Perfect Calibration 🎯¶

Over-Confident Predictions¶

Under-Confident Predictions¶

Systematic Bias 🔴¶

Sample Size Considerations¶

Small Datasets (N < 30)¶

Medium Datasets (30 < N < 100)¶

Large Datasets (N > 100)¶

Practical Guidelines¶

When to Worry 🚨¶

When Not to Worry¶

Relationship to Calibration Curve¶

Summary¶

Further Reading¶