Metrics Evolution Plot¶

The metrics evolution plot in ALchemist tracks how your Gaussian Process model's predictive performance changes as you collect more experimental data during the active learning loop. It provides a visual record of model improvement and helps you decide when to stop optimization.

What the Metrics Plot Shows¶

X-axis: Number of observations (training data points)
Y-axis: Performance metric value(s)

Key elements:

Line plot(s): Evolution of selected metric(s) over data collection
Metric options: RMSE, MAE, MAPE, R²
Multiple metrics: Can display several metrics simultaneously
Trends: Visual indication of convergence or degradation

Available Metrics¶

Root Mean Squared Error (RMSE)¶

\[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \]

Interpretation:

Units match your response variable
Lower is better (0 = perfect predictions)
Penalizes large errors more than small ones
Most common metric for regression

Typical values:

RMSE < 5% of response range: Excellent
RMSE 5-10% of response range: Good
RMSE > 20% of response range: Poor or insufficient data

Mean Absolute Error (MAE)¶

\[ \text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i| \]

Interpretation:

Units match your response variable
Lower is better (0 = perfect predictions)
Less sensitive to outliers than RMSE
Average magnitude of errors

Comparison to RMSE:

MAE < RMSE always (equality only if all errors identical)
RMSE/MAE ratio indicates error distribution
Ratio ≈ 1: Uniform errors
Ratio > 1.5: Some large errors (outliers)

Mean Absolute Percentage Error (MAPE)¶

\[ \text{MAPE} = \frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right| \]

Interpretation:

Percentage units (scale-independent)
Lower is better (0% = perfect predictions)
Useful for comparing across different response ranges
Can be unstable if \(y_i\) near zero

Typical values:

MAPE < 5%: Excellent
MAPE 5-10%: Good
MAPE 10-20%: Acceptable
MAPE > 20%: Poor

Warning: Undefined if any true value is exactly zero. ALchemist skips MAPE calculation in this case.

Coefficient of Determination (R²)¶

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} \]

Where \(\bar{y}\) is the mean of observed values.

Interpretation:

Dimensionless (0 to 1 for good models)
Higher is better (1 = perfect predictions)
Proportion of variance explained by model
Can be negative for very poor models

Typical values:

R² > 0.95: Excellent
R² 0.90-0.95: Good
R² 0.80-0.90: Acceptable
R² < 0.80: Poor or insufficient data

Note: R² from cross-validation (ALchemist default) is more reliable than training R².

When to Use the Metrics Plot¶

During Active Learning¶

Essential for:

Monitoring model improvement as data accumulates
Deciding when to stop optimization
Detecting convergence or plateaus
Identifying data quality issues

Check metrics plot:

After each batch of experiments
Before requesting new candidates
When considering stopping criteria

Comparing Models¶

Use metrics plot to:

Compare different kernels (Matern ν=1.5 vs RBF)
Evaluate different backends (sklearn vs BoTorch)
Test impact of transforms
Assess hyperparameter choices

Diagnosing Issues¶

Metrics plot reveals:

Degrading performance (possible overfitting)
Stagnant metrics (plateau reached)
Erratic behavior (data quality problems)
Unexpected trends (check preprocessing)

Accessing the Metrics Plot¶

In Web Application¶

Train a model in the GPR Panel
Click "Show Model Visualizations"
Select "Metrics Plot" from plot type buttons
Choose which metrics to display (checkboxes)

In Desktop Application¶

Train model in Model panel
Open Visualizations dialog
Metrics evolution available in plot options
Customize display and export

Interpreting Common Patterns¶

Ideal Pattern: Steady Improvement¶

Metrics improve (RMSE/MAE/MAPE decrease, R² increases) as data accumulates
Rate of improvement slows as model converges
Eventually plateaus at acceptable performance

What this means: Active learning is working as expected
Action: Continue until plateau, then stop optimization

Example:

Start: RMSE = 15, R² = 0.60 (10 samples)
Mid: RMSE = 8, R² = 0.88 (25 samples)
End: RMSE = 5, R² = 0.94 (40 samples) → Plateau

Warning Pattern: Degrading Performance¶

Metrics worsen (RMSE/MAE increase, R² decreases) as data accumulates
Performance peaks early then declines
May indicate overfitting or data issues

What this means: Problem with data quality or model
Actions:

Check for outliers in recent data
Verify experimental measurements
Inspect data preprocessing
Try different kernel or regularization
Check if hyperparameter optimization failing

Warning Pattern: No Improvement¶

Metrics flat or erratic across all data sizes
No clear trend of improvement
High variance between evaluations

What this means: Model not learning from data
Actions:

Verify variable space covers response range
Check data is actually varying (not constant)
Ensure preprocessing appropriate
Try different kernel family
Inspect initial sampling distribution

Expected Pattern: Early Volatility¶

Metrics fluctuate significantly with very few samples (< 15)
Behavior stabilizes as data accumulates
Trends become clear after 20-30 samples

What this means: Normal statistical noise with small samples
Action: Don't over-interpret early behavior, wait for more data

Deciding When to Stop¶

Convergence Criteria¶

Metrics-based:

RMSE/MAE plateau (< 5% change over 10 samples)
R² > target threshold (e.g., 0.90 or 0.95)
Absolute performance acceptable for application

Practical:

Budget exhausted (time, cost, materials)
Acceptable optimum found (target performance reached)
Diminishing returns (effort exceeds benefit)

Common Stopping Rules¶

Conservative:

R² > 0.95 AND no improvement in last 15 samples
RMSE < 2% of response range for 20 consecutive samples
Validation metrics stable across 3 CV folds

Moderate:

R² > 0.90 AND plateau for 10 samples
RMSE < 5% of response range
Acquisition function values < threshold

Aggressive:

R² > 0.85 achieved
RMSE better than baseline/literature
Optimization objective reached

Understanding Cross-Validation Metrics¶

K-Fold Cross-Validation¶

ALchemist uses k-fold cross-validation (default: 5 folds) for all datasets:

Process:

Split data into k groups (typically 5)
For each group:
Train on other k-1 groups
Predict on held-out group
Aggregate predictions across all folds

Advantages:

Good balance of bias/variance
Computationally efficient
Standard practice in machine learning
Reliable estimates across dataset sizes

Interpretation:

Reflects generalization to new data
More pessimistic than training metrics
More realistic for decision-making

Metric Selection Guidelines¶

Use RMSE when:¶

Response units meaningful (e.g., temperature °C, yield %)
Large errors particularly problematic
Standard metric expected in your field
Comparing models on same response

Use MAE when:¶

Outliers present (more robust)
All errors treated equally important
Easier interpretation needed (average error)
RMSE vs MAE comparison informative (error distribution)

Use MAPE when:¶

Comparing across different response scales
Percentage errors more interpretable
Response values far from zero (avoid division issues)
Scale-independent comparison needed

Use R² when:¶

Variance explanation important
Comparing to baseline (R² = 0)
Standard metric in your field (common in chemistry/materials)
Want single dimensionless metric

Display multiple metrics when:¶

Want comprehensive view
Different stakeholders prefer different metrics
Checking consistency across metrics
Diagnosing specific issues

Integration with Optimization¶

Metrics evolution informs optimization strategy:

Acquisition Function Choice¶

Poor metrics (R² < 0.75):

Favor exploration (UCB with high κ)
Collect diverse data first
Consider space-filling designs

Good metrics (R² > 0.90):

Allow exploitation (EI, PI)
Trust model predictions
Focus on promising regions

Batch Size Decisions¶

Rapidly improving metrics:

Smaller batches (adapt quickly)
Re-train frequently
Stay responsive to learning

Plateaued metrics:

Larger batches acceptable
Less frequent re-training
Efficiency over responsiveness

Stopping Criteria¶

Metrics-driven stopping:

Set R² or RMSE threshold
Monitor plateau duration
Balance performance vs cost

Practical Tips¶

Displaying Multiple Metrics¶

Recommended combinations:

Standard: RMSE + R² (accuracy and variance explained)
Comprehensive: RMSE + MAE + R² (multiple perspectives)
Scale-independent: MAPE + R² (for broad comparisons)

Avoid:

Too many metrics (cluttered plot)
Redundant pairs (RMSE + MAE without reason)

Interpreting Trends¶

Smooth trends: Good model stability
Erratic jumps: Check data quality or hyperparameter optimization
Sudden drops: Possible outlier added or CV issue
Linear improvement: Still learning, more data beneficial

Comparing Sessions¶

When comparing:

Use same metrics across sessions
Account for different data sizes
Consider response scale differences
Check cross-validation method consistency

Troubleshooting¶

If metrics aren't improving, check the parity plot for systematic bias and try different kernels. If metrics worsen, inspect recent data for outliers. High variance is expected with small datasets (n < 25). For persistent issues, see Model Performance.

Metrics Evolution Plot¶

What the Metrics Plot Shows¶

Available Metrics¶

Root Mean Squared Error (RMSE)¶

Mean Absolute Error (MAE)¶

Mean Absolute Percentage Error (MAPE)¶

Coefficient of Determination (R²)¶

When to Use the Metrics Plot¶

During Active Learning¶

Comparing Models¶

Diagnosing Issues¶

Accessing the Metrics Plot¶

In Web Application¶

In Desktop Application¶

Interpreting Common Patterns¶

Ideal Pattern: Steady Improvement¶

Warning Pattern: Degrading Performance¶

Warning Pattern: No Improvement¶

Expected Pattern: Early Volatility¶

Deciding When to Stop¶

Convergence Criteria¶

Common Stopping Rules¶

Understanding Cross-Validation Metrics¶

K-Fold Cross-Validation¶

Metric Selection Guidelines¶

Use RMSE when:¶

Use MAE when:¶

Use MAPE when:¶

Use R² when:¶

Display multiple metrics when:¶

Integration with Optimization¶

Acquisition Function Choice¶

Batch Size Decisions¶

Stopping Criteria¶

Practical Tips¶

Displaying Multiple Metrics¶

Interpreting Trends¶

Comparing Sessions¶

Troubleshooting¶

Further Reading¶