Interpreting Calibration Curves for Uncertainty Assessment¶
A calibration curve (also called a reliability diagram) is a diagnostic tool that helps you evaluate whether your Gaussian Process model's predicted confidence intervals have the correct coverage. In ALchemist, the calibration curve complements the Q-Q plot to provide a comprehensive view of your model's uncertainty quality.
What is Coverage Calibration?¶
When a Gaussian Process provides a prediction with uncertainty, it defines confidence intervals at various levels:
-
68% confidence interval: μ ± 1σ should contain 68% of true values
-
95% confidence interval: μ ± 1.96σ should contain 95% of true values
-
99% confidence interval: μ ± 2.58σ should contain 99% of true values
Well-calibrated coverage means that the empirical (observed) coverage matches the nominal (claimed) coverage. For example, if your model claims 95% confidence, then 95% of experimental observations should actually fall within those bounds.
Why Coverage Calibration Matters¶
Accurate coverage is essential for:
-
Reliable decision-making: Knowing when predictions are trustworthy
-
Risk management: Avoiding over-confident predictions in safety-critical applications
-
Efficient exploration: Balancing exploration and exploitation in optimization
-
Experimental planning: Determining when more data is needed vs. when to act on predictions
Understanding the Calibration Curve¶
Components of the Visualization¶
ALchemist's calibration curve display includes:
- Line plot: Nominal coverage (x-axis) vs. empirical coverage (y-axis)
- Perfect calibration line: Diagonal reference (y = x)
- Metrics table: Coverage values at standard confidence levels
- Color-coded status: Visual indicators for calibration quality
Reading the Plot¶
X-axis (Nominal Coverage): The confidence level claimed by the model (e.g., 0.68 = 68% confidence)
Y-axis (Empirical Coverage): The actual fraction of observations that fall within the predicted intervals
Perfect calibration: Points lie on the diagonal line (y = x)
Interpreting Coverage Patterns¶
Well-Calibrated Model¶
What it looks like:
-
Points closely follow the diagonal line
-
Empirical ≈ Nominal at all confidence levels
-
Status indicators show "Good" (green)
Metrics example:
Confidence Nominal Empirical Status
68% 0.68 0.67 ✓ Good
95% 0.95 0.94 ✓ Good
99% 0.99 0.98 ✓ Good
99.7% 0.997 0.995 ✓ Good
What it means:
-
Model uncertainties accurately reflect prediction errors
-
Confidence intervals have correct coverage
-
Safe to trust model predictions and uncertainties
-
Acquisition functions will make optimal decisions
Over-Confident Model¶
What it looks like:
-
Curve below the diagonal line
-
Empirical coverage < Nominal coverage
-
Status indicators show "Under-conf" (orange/red)
Metrics example:
Confidence Nominal Empirical Status
68% 0.68 0.55 Under-conf
95% 0.95 0.82 Under-conf
99% 0.99 0.91 Under-conf
99.7% 0.997 0.95 Under-conf
What it means:
-
Model is too confident in its predictions
-
Claimed "95% confidence" only captures 82% of observations
-
Prediction intervals are too narrow
-
Risk of missing optimal regions by over-exploiting
Why it happens:
-
Model underestimates noise in the data
-
Kernel is too restrictive (overfit to training data)
-
Insufficient data for problem complexity
-
Lengthscales too small (overfitting local variations)
How to fix: 1. Increase noise parameter: If using noise column, increase values 2. Regularization: Add explicit noise term to model 3. Change kernel: Try more flexible kernel (Matern ν=1.5 instead of ν=2.5) 4. Collect more data: Especially in high-variance regions 5. Apply calibration: ALchemist automatically applies calibration corrections 6. Check data quality: Look for outliers or measurement errors
Under-Confident Model¶
What it looks like:
-
Curve above the diagonal line
-
Empirical coverage > Nominal coverage
-
Status indicators show "Over-conf" (blue)
Metrics example:
Confidence Nominal Empirical Status
68% 0.68 0.78 Over-conf
95% 0.95 0.99 Over-conf
99% 0.99 1.00 Over-conf
99.7% 0.997 1.00 Over-conf
What it means:
-
Model is too cautious with its predictions
-
Claimed "95% confidence" actually captures 99% of observations
-
Prediction intervals are too wide
-
Risk of over-exploring, wasting experiments on unnecessary regions
Why it happens:
-
Model overestimates noise in the data
-
Kernel is too flexible (underfit)
-
Lengthscales too large (over-smoothing)
-
Prior distributions too broad
How to fix: 1. Decrease noise parameter: Reduce explicit noise values 2. Tighter kernel: Try less flexible kernel (Matern ν=2.5 or RBF) 3. Optimize hyperparameters: Ensure lengthscales are optimized, not fixed too large 4. Check preprocessing: Ensure data is properly scaled 5. More aggressive optimization: Increase training iterations
Note: Under-confidence is generally less problematic than over-confidence, but wastes experimental resources.
Mixed Calibration Issues 🔄¶
What it looks like:
-
Curve crosses the diagonal line
-
Some confidence levels over-confident, others under-confident
-
Inconsistent status indicators
Metrics example:
Confidence Nominal Empirical Status
68% 0.68 0.62 Under-conf
95% 0.95 0.96 Good
99% 0.99 1.00 Over-conf
99.7% 0.997 1.00 Over-conf
What it means:
-
Uncertainty estimates have non-normal distribution
-
May indicate model misspecification
-
Different behavior in tails vs. center of distribution
How to fix: 1. Check data distribution: Look for outliers or bimodality 2. Transform outputs: Consider log or Box-Cox transformation 3. Different kernel: Experiment with alternative kernel families 4. Stratified sampling: Ensure training data covers full range
Sample Size Considerations¶
Small Datasets (N < 30) 🔍¶
-
High variability in empirical coverage estimates
-
Coverage metrics less reliable
-
±10% deviation from nominal is common
-
Focus on overall trends rather than exact values
Interpretation guidance:
Empirical coverage of 0.85 for nominal 0.95 is acceptable with N=20
Same coverage would be concerning with N=100
Medium Datasets (30 < N < 100)¶
-
Moderate reliability in coverage estimates
-
±5% deviation becoming significant
-
Clear patterns indicate real issues
Large Datasets (N > 100) 🎯¶
-
High confidence in calibration assessment
-
Even small deviations (±3%) may indicate issues
-
Coverage metrics are highly reliable
ALchemist displays a warning for N < 30:
Note: Small sample size (N=25). Coverage estimates may be unreliable.
Calibration Status Indicators¶
ALchemist uses color-coded status indicators for quick assessment:
| Status | Color | Criterion | Interpretation |
|---|---|---|---|
| ✓ Good | Green | |Empirical - Nominal| < 0.05 | Well-calibrated |
| Under-conf | Orange | Empirical < Nominal - 0.05 | Too confident (narrow intervals) |
| Over-conf | Blue | Empirical > Nominal + 0.05 | Too cautious (wide intervals) |
These thresholds are adjustable based on sample size and application requirements.
Using Calibration with Q-Q Plots¶
The calibration curve and Q-Q plot provide complementary information:
Q-Q Plot Strengths:¶
-
Tests normality assumption
-
Detects bias (Mean(z) ≠ 0)
-
Shows over/under-confidence via Std(z)
-
Visual pattern recognition
Calibration Curve Strengths:¶
-
Quantifies coverage at specific confidence levels
-
Easier to interpret numerically
-
Less sensitive to distributional assumptions
-
Direct link to decision-making thresholds
Combined Analysis:¶
Both good: Model is well-calibrated and uncertainties are reliable
Q-Q bad, Calibration good: Non-normal errors but coverage is correct (acceptable for many applications)
Q-Q good, Calibration bad: Distribution is normal but variance is miscalibrated (systematic scaling issue)
Both bad: Significant model misspecification (investigate data and model choices)
Practical Guidelines¶
When to Trust Your Model¶
Proceed with confidence if:
-
All coverage metrics within ±5% of nominal (for N > 30)
-
Status indicators show "Good" or mild "Over-conf"
-
Calibration curve closely follows diagonal
-
Q-Q plot also shows good calibration
When to Improve Calibration¶
Take corrective action if:
-
Multiple "Under-conf" indicators (model too confident)
-
Large deviations (>10%) from nominal coverage
-
Consistent pattern across all confidence levels
-
Sample size is adequate (N > 30) for reliable assessment
When to Collect More Data¶
Consider more experiments if:
-
Sample size is small (N < 30) with unclear patterns
-
High variance in coverage estimates
-
Model appears underfit (high RMSE, low R²)
-
Coverage is acceptable but prediction accuracy is poor
Calibration in Active Learning Context¶
During Bayesian optimization, calibration affects:
Acquisition Function Performance:¶
-
Expected Improvement: Relies on accurate σ for exploration/exploitation balance
-
UCB: Directly uses σ in the acquisition formula
-
Probability of Improvement: Needs correct uncertainty quantification
Optimization Strategy:¶
-
Over-confident model: May converge prematurely to local optima (too much exploitation)
-
Under-confident model: May waste experiments exploring known regions (too much exploration)
-
Well-calibrated model: Optimal balance, efficient convergence
Stopping Criteria:¶
-
Calibrated uncertainties help determine when optimization has converged
-
Under-confident models may never reach stopping criteria
-
Over-confident models may stop too early
ALchemist's Automatic Calibration¶
ALchemist implements automatic uncertainty calibration:
- Cross-validation: Computes z-scores from CV predictions
- Calibration factor: Calculates
s = Std(z)from residuals - Scaling: Multiplies predicted σ by calibration factor
- Application: Automatically applied to future predictions
Effect:
-
If Std(z) = 1.5 (over-confident), future σ predictions are scaled by 1.5×
-
If Std(z) = 0.7 (under-confident), future σ predictions are scaled by 0.7×
-
Brings model toward better calibration without retraining
Toggle:
-
Calibrated vs. uncalibrated results viewable in visualization panel
-
Compare to see calibration impact on your specific dataset
Summary¶
| Pattern | Coverage vs. Diagonal | Issue | Primary Fix |
|---|---|---|---|
| On diagonal | Aligned | ✓ None | - |
| Below diagonal | Empirical < Nominal | Over-confident | Increase noise/uncertainty |
| Above diagonal | Empirical > Nominal | Under-confident | Reduce noise, tighter kernel |
| Crosses diagonal | Mixed | Model misspecification | Check data, try different kernel |
| High scatter | Variable | Small sample | Collect more data |
Further Reading¶
-
Q-Q Plot Interpretation - Complementary diagnostic for normality
-
Model Performance - Overall model quality assessment
-
Calibration Curve Visualization - How to generate and use the plot in ALchemist
For theoretical background on uncertainty calibration:
-
Guo et al. (2017), "On Calibration of Modern Neural Networks"
-
Kuleshov et al. (2018), "Accurate Uncertainties for Deep Learning Using Calibrated Regression"
-
DeGroot & Fienberg (1983), "The Comparison and Evaluation of Forecasters"