Model Performance and Methodology⌗

This Explainable Boosting Machine (EBM) model was developed through an extensive cross-validation process, involving 1,200 experiments across various model types and hyperparameters. The final model, fitted to the entire dataset, demonstrates robust performance:

Matthews Correlation Coefficient (MCC): $0.88$
Area Under the ROC Curve (AUC): $0.99$

The MCC is defined as:

$$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$

where TP, TN, FP, and FN are True Positives, True Negatives, False Positives, and False Negatives, respectively.

The AUC represents the model’s ability to distinguish between classes and is equal to the probability that the model ranks a random positive instance higher than a random negative instance.

Interpreting the EBM Explanations⌗

The following interactive plot provides global explanations for the EBM model:

Each graph represents a feature’s impact on the model’s prediction.
The y-axis shows the feature’s contribution to the log-odds of the outcome.
For continuous variables, the x-axis represents the feature value range.
For categorical variables, each bar represents a category’s impact.

Note: Some categorical variables may appear as continuous due to missing data imputation techniques employed during preprocessing.

Creating Prediction Model⌗

Development of an Interpretable Ensemble Model for Clinical Prediction⌗

The process begins with the above single well-performing Explainable Boosting Machine (EBM) model and evolves into a carefully constructed ensemble of diverse models. This approach maintains interpretability while enhancing prediction reliability.

Methodological Framework⌗

Base Model Development⌗

We began by fitting a single EBM model to our complete dataset. This initial model served two crucial purposes. First, it identified the 15 most important features through EBM’s native interpretability mechanisms. Second, it provided a strong foundation for developing our ensemble through its optimized hyperparameters.

Ensemble Construction⌗

Building upon our base model, we developed a diverse set of 100 EBM variants. Each variant was created by introducing controlled random noise to the base model’s hyperparameters while maintaining their fundamental characteristics. This approach ensured meaningful diversity while preserving model quality.

We evaluated each of these 100 models using a consistent train-test split methodology, with 75% of data for training and 25% for testing. The evaluation process focused on three key metrics:

Matthews Correlation Coefficient (MCC)
Area Under the ROC Curve (AUC)
False Positive Count

$$\text{Score} = (1 + \text{MCC}) \times 0.25 + 0.5 \times \text{AUC}$$

Final Ensemble Architecture⌗

The ensemble makes predictions using a maximum probability approach across all models, with a carefully calibrated decision threshold of 0.082. This threshold was determined through rigorous testing to optimize performance while maintaining clinical safety constraints.

Validation and Performance⌗

Robustness Testing⌗

We validated our ensemble’s performance through 50 different random data splits. This extensive testing compared the ensemble against each individual model’s performance, consistently demonstrating the ensemble’s superior reliability and accuracy.

Performance Metrics⌗

The ensemble achieved remarkable performance metrics:

MCC: 0.64 ± 0.08
AUC: 0.96 ± 0.015

These results demonstrate significant improvements over individual model performance, particularly in consistency and reliability.

Interpretability Framework⌗

Despite being an ensemble model, we maintained interpretability through two complementary approaches:

Each component EBM model provides its native interpretability mechanisms
We integrated SHAP (SHapley Additive exPlanations) values for ensemble-level explanations

This dual approach ensures that predictions remain transparent and interpretable for clinical use, providing both global feature importance and individual prediction explanations.

Conclusion⌗

Our methodology successfully creates a robust, interpretable ensemble model that outperforms individual models while maintaining clinical applicability. The careful balance of diversity in model parameters, rigorous validation, and maintained interpretability makes this approach particularly suitable for clinical applications where both accuracy and explainability are crucial.

Using the Interactive Prediction Tool⌗

This tool is intended for educational and research purposes only and should not be used for clinical decision-making without proper medical oversight. Below, you’ll find an interactive app for entering sample features and obtaining predictions:

Select features from the dropdown menu and input their values.
The tool will provide a prediction based on the entered data.

Important Caveats:

Predictions are based solely on the training data and may not generalize to all populations.
Entering only a few features is unlikely to result in a “Dead” prediction due to two factors:
1. Our analysis indicates that the outcome depends on a complex interplay of numerous factors.
2. Non-entered features are automatically filled with mean values from our dataset. Given that our dataset is highly balanced towards the negative class (“Alive”), these mean values tend to bias predictions towards the majority class.
The model’s performance relies on the comprehensive set of features used in training; predictions based on limited inputs should be interpreted with caution.
The model’s threshold was carefully tuned to approximately 0.82, meaning predictions above this value are classified as “Dead”. This threshold minimizes false positives (i.e., cases predicted as “Dead” but actually “Alive”), prioritizing specificity in clinical decision-making.

Interpreting SHAP Visualizations for Clinical Decision Support⌗

Understanding SHAP Values⌗

SHAP (SHapley Additive exPlanations) values provide personalized explanations for each prediction. In our ensemble model, we use the KernelExplainer with 25 background samples to generate these explanations.

Reading the Waterfall Plot⌗

Basic Structure⌗

The waterfall plot displays how each feature contributes to moving from a base prediction (average model output) to the final prediction for a specific patient:

Base Value:
- Starting point at left ($E[f(x)]$)
- Represents average prediction across population
- Typically around 0.082 (our threshold)
Feature Contributions:
- Each bar shows one feature’s impact
- Blue bars: Push prediction toward high risk
- Red bars: Push prediction toward low risk
- Length of bar: Magnitude of impact
Final Prediction:
- Rightmost point
- Sum of base value and all feature contributions
- Compare to threshold (0.082) for decision

Clinical Interpretation Guide⌗

1. Key Aspects to Consider⌗

Relative Importance⌗

Larger bars = stronger influence
Order of features indicates impact magnitude
Compare feature impacts within same patient

Direction of Impact⌗

Blue (positive) = increases risk
Red (negative) = decreases risk
Direction is patient-specific

Interactions⌗

Values include feature interactions
Total impact considers relationships between features
More reliable than looking at features in isolation

2. Common Patterns⌗

High-Risk Pattern:⌗

Multiple strong positive (blue) contributions
Few or weak negative (red) contributions
Final value significantly above 0.082

Low-Risk Pattern:⌗

Strong negative (red) contributions
Weak or few positive (blue) contributions
Final value below 0.082

Technical Notes⌗

Value Calculation:⌗

$$\phi_i(x) = \sum_{S \subseteq F \setminus {i}} \frac{|S|!(|F|-|S|-1)!}{|F|!}[f_x(S \cup {i}) - f_x(S)]$$ where:

$\phi_i(x)$ is the SHAP value for feature $i$
$F$ is the set of all features
$S$ represents feature subsets

Ensemble Integration:⌗

Values calculated using maximum probability across models
25 background samples for stable estimates 0 - Maintains feature interaction effects

Best Practices for Interpretation⌗

Always consider:
- Clinical context
- Patient-specific factors
- Reliability of input data
- Confidence in measurements
Remember:
- Values are patient-specific
- Contributions are relative to population average
- Interpretations support, not replace, clinical judgment

Interactive Predictions⌗

Please refresh the page in order that the app load or click here

EBM Model for Predicting Agromegaly Patient Outcomes