Metrics Evolution Plot¶
The metrics evolution plot in ALchemist tracks how your Gaussian Process model's predictive performance changes as you collect more experimental data during the active learning loop. It provides a visual record of model improvement and helps you decide when to stop optimization.
What the Metrics Plot Shows¶
X-axis: Number of observations (training data points)
Y-axis: Performance metric value(s)
Key elements:
-
Line plot(s): Evolution of selected metric(s) over data collection
-
Metric options: RMSE, MAE, MAPE, R²
-
Multiple metrics: Can display several metrics simultaneously
-
Trends: Visual indication of convergence or degradation
Available Metrics¶
Root Mean Squared Error (RMSE)¶
Interpretation:
-
Units match your response variable
-
Lower is better (0 = perfect predictions)
-
Penalizes large errors more than small ones
-
Most common metric for regression
Typical values:
-
RMSE < 5% of response range: Excellent
-
RMSE 5-10% of response range: Good
-
RMSE > 20% of response range: Poor or insufficient data
Mean Absolute Error (MAE)¶
Interpretation:
-
Units match your response variable
-
Lower is better (0 = perfect predictions)
-
Less sensitive to outliers than RMSE
-
Average magnitude of errors
Comparison to RMSE:
-
MAE < RMSE always (equality only if all errors identical)
-
RMSE/MAE ratio indicates error distribution
-
Ratio ≈ 1: Uniform errors
-
Ratio > 1.5: Some large errors (outliers)
Mean Absolute Percentage Error (MAPE)¶
Interpretation:
-
Percentage units (scale-independent)
-
Lower is better (0% = perfect predictions)
-
Useful for comparing across different response ranges
-
Can be unstable if \(y_i\) near zero
Typical values:
-
MAPE < 5%: Excellent
-
MAPE 5-10%: Good
-
MAPE 10-20%: Acceptable
-
MAPE > 20%: Poor
Warning: Undefined if any true value is exactly zero. ALchemist skips MAPE calculation in this case.
Coefficient of Determination (R²)¶
Where \(\bar{y}\) is the mean of observed values.
Interpretation:
-
Dimensionless (0 to 1 for good models)
-
Higher is better (1 = perfect predictions)
-
Proportion of variance explained by model
-
Can be negative for very poor models
Typical values:
-
R² > 0.95: Excellent
-
R² 0.90-0.95: Good
-
R² 0.80-0.90: Acceptable
-
R² < 0.80: Poor or insufficient data
Note: R² from cross-validation (ALchemist default) is more reliable than training R².
When to Use the Metrics Plot¶
During Active Learning¶
Essential for:
-
Monitoring model improvement as data accumulates
-
Deciding when to stop optimization
-
Detecting convergence or plateaus
-
Identifying data quality issues
Check metrics plot:
-
After each batch of experiments
-
Before requesting new candidates
-
When considering stopping criteria
Comparing Models¶
Use metrics plot to:
-
Compare different kernels (Matern ν=1.5 vs RBF)
-
Evaluate different backends (sklearn vs BoTorch)
-
Test impact of transforms
-
Assess hyperparameter choices
Diagnosing Issues¶
Metrics plot reveals:
-
Degrading performance (possible overfitting)
-
Stagnant metrics (plateau reached)
-
Erratic behavior (data quality problems)
-
Unexpected trends (check preprocessing)
Accessing the Metrics Plot¶
In Web Application¶
- Train a model in the GPR Panel
- Click "Show Model Visualizations"
- Select "Metrics Plot" from plot type buttons
- Choose which metrics to display (checkboxes)
In Desktop Application¶
- Train model in Model panel
- Open Visualizations dialog
- Metrics evolution available in plot options
- Customize display and export
Interpreting Common Patterns¶
Ideal Pattern: Steady Improvement¶
Metrics improve (RMSE/MAE/MAPE decrease, R² increases) as data accumulates
Rate of improvement slows as model converges
Eventually plateaus at acceptable performance
What this means: Active learning is working as expected
Action: Continue until plateau, then stop optimization
Example:
-
Start: RMSE = 15, R² = 0.60 (10 samples)
-
Mid: RMSE = 8, R² = 0.88 (25 samples)
-
End: RMSE = 5, R² = 0.94 (40 samples) → Plateau
Warning Pattern: Degrading Performance¶
Metrics worsen (RMSE/MAE increase, R² decreases) as data accumulates
Performance peaks early then declines
May indicate overfitting or data issues
What this means: Problem with data quality or model
Actions:
- Check for outliers in recent data
- Verify experimental measurements
- Inspect data preprocessing
- Try different kernel or regularization
- Check if hyperparameter optimization failing
Warning Pattern: No Improvement¶
Metrics flat or erratic across all data sizes
No clear trend of improvement
High variance between evaluations
What this means: Model not learning from data
Actions:
- Verify variable space covers response range
- Check data is actually varying (not constant)
- Ensure preprocessing appropriate
- Try different kernel family
- Inspect initial sampling distribution
Expected Pattern: Early Volatility¶
Metrics fluctuate significantly with very few samples (< 15)
Behavior stabilizes as data accumulates
Trends become clear after 20-30 samples
What this means: Normal statistical noise with small samples
Action: Don't over-interpret early behavior, wait for more data
Deciding When to Stop¶
Convergence Criteria¶
Metrics-based:
-
RMSE/MAE plateau (< 5% change over 10 samples)
-
R² > target threshold (e.g., 0.90 or 0.95)
-
Absolute performance acceptable for application
Practical:
-
Budget exhausted (time, cost, materials)
-
Acceptable optimum found (target performance reached)
-
Diminishing returns (effort exceeds benefit)
Common Stopping Rules¶
Conservative:
-
R² > 0.95 AND no improvement in last 15 samples
-
RMSE < 2% of response range for 20 consecutive samples
-
Validation metrics stable across 3 CV folds
Moderate:
-
R² > 0.90 AND plateau for 10 samples
-
RMSE < 5% of response range
-
Acquisition function values < threshold
Aggressive:
-
R² > 0.85 achieved
-
RMSE better than baseline/literature
-
Optimization objective reached
Understanding Cross-Validation Metrics¶
K-Fold Cross-Validation¶
ALchemist uses k-fold cross-validation (default: 5 folds) for all datasets:
Process:
- Split data into k groups (typically 5)
- For each group:
- Train on other k-1 groups
- Predict on held-out group
- Aggregate predictions across all folds
Advantages:
-
Good balance of bias/variance
-
Computationally efficient
-
Standard practice in machine learning
-
Reliable estimates across dataset sizes
Interpretation:
-
Reflects generalization to new data
-
More pessimistic than training metrics
-
More realistic for decision-making
Metric Selection Guidelines¶
Use RMSE when:¶
-
Response units meaningful (e.g., temperature °C, yield %)
-
Large errors particularly problematic
-
Standard metric expected in your field
-
Comparing models on same response
Use MAE when:¶
-
Outliers present (more robust)
-
All errors treated equally important
-
Easier interpretation needed (average error)
-
RMSE vs MAE comparison informative (error distribution)
Use MAPE when:¶
-
Comparing across different response scales
-
Percentage errors more interpretable
-
Response values far from zero (avoid division issues)
-
Scale-independent comparison needed
Use R² when:¶
-
Variance explanation important
-
Comparing to baseline (R² = 0)
-
Standard metric in your field (common in chemistry/materials)
-
Want single dimensionless metric
Display multiple metrics when:¶
-
Want comprehensive view
-
Different stakeholders prefer different metrics
-
Checking consistency across metrics
-
Diagnosing specific issues
Integration with Optimization¶
Metrics evolution informs optimization strategy:
Acquisition Function Choice¶
Poor metrics (R² < 0.75):
-
Favor exploration (UCB with high κ)
-
Collect diverse data first
-
Consider space-filling designs
Good metrics (R² > 0.90):
-
Allow exploitation (EI, PI)
-
Trust model predictions
-
Focus on promising regions
Batch Size Decisions¶
Rapidly improving metrics:
-
Smaller batches (adapt quickly)
-
Re-train frequently
-
Stay responsive to learning
Plateaued metrics:
-
Larger batches acceptable
-
Less frequent re-training
-
Efficiency over responsiveness
Stopping Criteria¶
Metrics-driven stopping:
-
Set R² or RMSE threshold
-
Monitor plateau duration
-
Balance performance vs cost
Practical Tips¶
Displaying Multiple Metrics¶
Recommended combinations:
-
Standard: RMSE + R² (accuracy and variance explained)
-
Comprehensive: RMSE + MAE + R² (multiple perspectives)
-
Scale-independent: MAPE + R² (for broad comparisons)
Avoid:
-
Too many metrics (cluttered plot)
-
Redundant pairs (RMSE + MAE without reason)
Interpreting Trends¶
Smooth trends: Good model stability
Erratic jumps: Check data quality or hyperparameter optimization
Sudden drops: Possible outlier added or CV issue
Linear improvement: Still learning, more data beneficial
Comparing Sessions¶
When comparing:
-
Use same metrics across sessions
-
Account for different data sizes
-
Consider response scale differences
-
Check cross-validation method consistency
Troubleshooting¶
If metrics aren't improving, check the parity plot for systematic bias and try different kernels. If metrics worsen, inspect recent data for outliers. High variance is expected with small datasets (n < 25). For persistent issues, see Model Performance.
Further Reading¶
- Parity Plot - Visual accuracy assessment
- Model Performance - Comprehensive model evaluation
- Q-Q Plot - Uncertainty calibration diagnostic
- Calibration Curve - Coverage verification
Key Takeaway: The metrics evolution plot is your guide to tracking model improvement during active learning. Use it to decide when your model is good enough and when to stop collecting data.