Interpreting Calibration Curves for Uncertainty Assessment¶

A calibration curve (also called a reliability diagram) is a diagnostic tool that helps you evaluate whether your Gaussian Process model's predicted confidence intervals have the correct coverage. In ALchemist, the calibration curve complements the Q-Q plot to provide a comprehensive view of your model's uncertainty quality.

What is Coverage Calibration?¶

When a Gaussian Process provides a prediction with uncertainty, it defines confidence intervals at various levels:

68% confidence interval: μ ± 1σ should contain 68% of true values
95% confidence interval: μ ± 1.96σ should contain 95% of true values
99% confidence interval: μ ± 2.58σ should contain 99% of true values

Well-calibrated coverage means that the empirical (observed) coverage matches the nominal (claimed) coverage. For example, if your model claims 95% confidence, then 95% of experimental observations should actually fall within those bounds.

Why Coverage Calibration Matters¶

Accurate coverage is essential for:

Reliable decision-making: Knowing when predictions are trustworthy
Risk management: Avoiding over-confident predictions in safety-critical applications
Efficient exploration: Balancing exploration and exploitation in optimization
Experimental planning: Determining when more data is needed vs. when to act on predictions

Understanding the Calibration Curve¶

Components of the Visualization¶

ALchemist's calibration curve display includes:

Line plot: Nominal coverage (x-axis) vs. empirical coverage (y-axis)
Perfect calibration line: Diagonal reference (y = x)
Metrics table: Coverage values at standard confidence levels
Color-coded status: Visual indicators for calibration quality

Reading the Plot¶

X-axis (Nominal Coverage): The confidence level claimed by the model (e.g., 0.68 = 68% confidence)

Y-axis (Empirical Coverage): The actual fraction of observations that fall within the predicted intervals

Perfect calibration: Points lie on the diagonal line (y = x)

Interpreting Coverage Patterns¶

Well-Calibrated Model¶

What it looks like:

Points closely follow the diagonal line
Empirical ≈ Nominal at all confidence levels
Status indicators show "Good" (green)

Metrics example:

Confidence   Nominal   Empirical   Status
68%          0.68      0.67        ✓ Good
95%          0.95      0.94        ✓ Good
99%          0.99      0.98        ✓ Good
99.7%        0.997     0.995       ✓ Good

What it means:

Model uncertainties accurately reflect prediction errors
Confidence intervals have correct coverage
Safe to trust model predictions and uncertainties
Acquisition functions will make optimal decisions

Over-Confident Model¶

What it looks like:

Curve below the diagonal line
Empirical coverage < Nominal coverage
Status indicators show "Under-conf" (orange/red)

Metrics example:

Confidence   Nominal   Empirical   Status
68%          0.68      0.55        Under-conf
95%          0.95      0.82        Under-conf
99%          0.99      0.91        Under-conf
99.7%        0.997     0.95        Under-conf

What it means:

Model is too confident in its predictions
Claimed "95% confidence" only captures 82% of observations
Prediction intervals are too narrow
Risk of missing optimal regions by over-exploiting

Why it happens:

Model underestimates noise in the data
Kernel is too restrictive (overfit to training data)
Insufficient data for problem complexity
Lengthscales too small (overfitting local variations)

How to fix: 1. Increase noise parameter: If using noise column, increase values 2. Regularization: Add explicit noise term to model 3. Change kernel: Try more flexible kernel (Matern ν=1.5 instead of ν=2.5) 4. Collect more data: Especially in high-variance regions 5. Apply calibration: ALchemist automatically applies calibration corrections 6. Check data quality: Look for outliers or measurement errors

Under-Confident Model¶

What it looks like:

Curve above the diagonal line
Empirical coverage > Nominal coverage
Status indicators show "Over-conf" (blue)

Metrics example:

Confidence   Nominal   Empirical   Status
68%          0.68      0.78        Over-conf
95%          0.95      0.99        Over-conf
99%          0.99      1.00        Over-conf
99.7%        0.997     1.00        Over-conf

What it means:

Model is too cautious with its predictions
Claimed "95% confidence" actually captures 99% of observations
Prediction intervals are too wide
Risk of over-exploring, wasting experiments on unnecessary regions

Why it happens:

Model overestimates noise in the data
Kernel is too flexible (underfit)
Lengthscales too large (over-smoothing)
Prior distributions too broad

How to fix: 1. Decrease noise parameter: Reduce explicit noise values 2. Tighter kernel: Try less flexible kernel (Matern ν=2.5 or RBF) 3. Optimize hyperparameters: Ensure lengthscales are optimized, not fixed too large 4. Check preprocessing: Ensure data is properly scaled 5. More aggressive optimization: Increase training iterations

Note: Under-confidence is generally less problematic than over-confidence, but wastes experimental resources.

Mixed Calibration Issues 🔄¶

What it looks like:

Curve crosses the diagonal line
Some confidence levels over-confident, others under-confident
Inconsistent status indicators

Metrics example:

Confidence   Nominal   Empirical   Status
68%          0.68      0.62        Under-conf
95%          0.95      0.96        Good
99%          0.99      1.00        Over-conf
99.7%        0.997     1.00        Over-conf

What it means:

Uncertainty estimates have non-normal distribution
May indicate model misspecification
Different behavior in tails vs. center of distribution

How to fix: 1. Check data distribution: Look for outliers or bimodality 2. Transform outputs: Consider log or Box-Cox transformation 3. Different kernel: Experiment with alternative kernel families 4. Stratified sampling: Ensure training data covers full range

Sample Size Considerations¶

Small Datasets (N < 30) 🔍¶

High variability in empirical coverage estimates
Coverage metrics less reliable
±10% deviation from nominal is common
Focus on overall trends rather than exact values

Interpretation guidance:

Empirical coverage of 0.85 for nominal 0.95 is acceptable with N=20
Same coverage would be concerning with N=100

Medium Datasets (30 < N < 100)¶

Moderate reliability in coverage estimates
±5% deviation becoming significant
Clear patterns indicate real issues

Large Datasets (N > 100) 🎯¶

High confidence in calibration assessment
Even small deviations (±3%) may indicate issues
Coverage metrics are highly reliable

ALchemist displays a warning for N < 30:

Note: Small sample size (N=25). Coverage estimates may be unreliable.

Calibration Status Indicators¶

ALchemist uses color-coded status indicators for quick assessment:

Status	Color	Criterion	Interpretation
✓ Good	Green	\|Empirical - Nominal\| < 0.05	Well-calibrated
Under-conf	Orange	Empirical < Nominal - 0.05	Too confident (narrow intervals)
Over-conf	Blue	Empirical > Nominal + 0.05	Too cautious (wide intervals)

These thresholds are adjustable based on sample size and application requirements.

Using Calibration with Q-Q Plots¶

The calibration curve and Q-Q plot provide complementary information:

Q-Q Plot Strengths:¶

Tests normality assumption
Detects bias (Mean(z) ≠ 0)
Shows over/under-confidence via Std(z)
Visual pattern recognition

Calibration Curve Strengths:¶

Quantifies coverage at specific confidence levels
Easier to interpret numerically
Less sensitive to distributional assumptions
Direct link to decision-making thresholds

Combined Analysis:¶

Both good: Model is well-calibrated and uncertainties are reliable

Q-Q bad, Calibration good: Non-normal errors but coverage is correct (acceptable for many applications)

Q-Q good, Calibration bad: Distribution is normal but variance is miscalibrated (systematic scaling issue)

Both bad: Significant model misspecification (investigate data and model choices)

Practical Guidelines¶

When to Trust Your Model¶

Proceed with confidence if:

All coverage metrics within ±5% of nominal (for N > 30)
Status indicators show "Good" or mild "Over-conf"
Calibration curve closely follows diagonal
Q-Q plot also shows good calibration

When to Improve Calibration¶

Take corrective action if:

Multiple "Under-conf" indicators (model too confident)
Large deviations (>10%) from nominal coverage
Consistent pattern across all confidence levels
Sample size is adequate (N > 30) for reliable assessment

When to Collect More Data¶

Consider more experiments if:

Sample size is small (N < 30) with unclear patterns
High variance in coverage estimates
Model appears underfit (high RMSE, low R²)
Coverage is acceptable but prediction accuracy is poor

Calibration in Active Learning Context¶

During Bayesian optimization, calibration affects:

Acquisition Function Performance:¶

Expected Improvement: Relies on accurate σ for exploration/exploitation balance
UCB: Directly uses σ in the acquisition formula
Probability of Improvement: Needs correct uncertainty quantification

Optimization Strategy:¶

Over-confident model: May converge prematurely to local optima (too much exploitation)
Under-confident model: May waste experiments exploring known regions (too much exploration)
Well-calibrated model: Optimal balance, efficient convergence

Stopping Criteria:¶

Calibrated uncertainties help determine when optimization has converged
Under-confident models may never reach stopping criteria
Over-confident models may stop too early

ALchemist's Automatic Calibration¶

ALchemist implements automatic uncertainty calibration:

Cross-validation: Computes z-scores from CV predictions
Calibration factor: Calculates s = Std(z) from residuals
Scaling: Multiplies predicted σ by calibration factor
Application: Automatically applied to future predictions

Effect:

If Std(z) = 1.5 (over-confident), future σ predictions are scaled by 1.5×
If Std(z) = 0.7 (under-confident), future σ predictions are scaled by 0.7×
Brings model toward better calibration without retraining

Toggle:

Calibrated vs. uncalibrated results viewable in visualization panel
Compare to see calibration impact on your specific dataset

Summary¶

Pattern	Coverage vs. Diagonal	Issue	Primary Fix
On diagonal	Aligned	✓ None	-
Below diagonal	Empirical < Nominal	Over-confident	Increase noise/uncertainty
Above diagonal	Empirical > Nominal	Under-confident	Reduce noise, tighter kernel
Crosses diagonal	Mixed	Model misspecification	Check data, try different kernel
High scatter	Variable	Small sample	Collect more data

Interpreting Calibration Curves for Uncertainty Assessment¶

What is Coverage Calibration?¶

Why Coverage Calibration Matters¶

Understanding the Calibration Curve¶

Components of the Visualization¶

Reading the Plot¶

Interpreting Coverage Patterns¶

Well-Calibrated Model¶

Over-Confident Model¶

Under-Confident Model¶

Mixed Calibration Issues 🔄¶

Sample Size Considerations¶

Small Datasets (N < 30) 🔍¶

Medium Datasets (30 < N < 100)¶

Large Datasets (N > 100) 🎯¶

Calibration Status Indicators¶

Using Calibration with Q-Q Plots¶

Q-Q Plot Strengths:¶

Calibration Curve Strengths:¶

Combined Analysis:¶

Practical Guidelines¶

When to Trust Your Model¶

When to Improve Calibration¶

When to Collect More Data¶

Calibration in Active Learning Context¶

Acquisition Function Performance:¶

Optimization Strategy:¶

Stopping Criteria:¶

ALchemist's Automatic Calibration¶

Summary¶

Further Reading¶