Chapter 22: Tree-Based Models and Interpretable ML

Johnson’s First Principle: The Black Box is Unacceptable

A deep neural network that outputs “Cancer: 95%” but cannot explain which genes drove the decision is practically useless in biology and clinical diagnostics. An algorithm that hides the biology cannot generate hypotheses, cannot be validated by domain experts, and cannot pass regulatory scrutiny.

Chapter 21 established the framework for machine learning in the \(p \gg n\) regime: bias-variance tradeoff, regularization, data leakage, and validation. That chapter was about how to avoid fooling yourself with high-dimensional data. This chapter is about which model to use once you’ve done that — and why tree-based models dominate tabular omics data, even as neural networks dominate sequence and image data.

Core Concepts

Decision Trees

A decision tree recursively partitions the feature space into rectangular regions. At each node, the algorithm selects the feature and threshold that minimize impurity:

Gini impurity: \(G = 1 - \sum_{c=1}^{C} p_c^2\), where \(p_c\) is the proportion of samples in class \(c\) at the node. Gini is minimized when all samples belong to one class (pure node) and maximized when classes are evenly split.

Pruning: Decision trees are prone to overfitting — a tree that grows until every leaf is pure has memorized the training data. Pruning (cost-complexity pruning) penalizes tree size: \(R_\alpha(T) = R(T) + \alpha|T|\), where \(R(T)\) is the training error, \(|T|\) is the number of terminal nodes, and \(\alpha\) controls the penalty.

Random Forests

Random forests (Breiman, 2001) reduce variance by averaging many de-correlated trees. Two sources of randomness:

Bagging (bootstrap aggregating): Each tree is trained on a bootstrap sample (random sample with replacement) of the training data.
Random feature subsets: At each split, only a random subset of features (typically \(\sqrt{p}\) for classification, \(p/3\) for regression) is considered.

The out-of-bag (OOB) error — predictions for each training sample using trees that did not include it in their bootstrap sample — provides an unbiased performance estimate without a held-out test set.

This connects directly to the bias-variance tradeoff from Chapter 21. Random forests reduce variance by averaging: individual trees are high-variance (slight changes in training data produce very different trees), but averaging \(B\) de-correlated trees reduces variance by approximately \(\sigma^2 / B\) without increasing bias. Gradient boosting (below) reduces bias instead — each new tree targets the residuals the ensemble cannot yet fit, at the cost of potentially increasing variance if too many trees are added.

Gradient Boosting

Gradient boosting (XGBoost, LightGBM, CatBoost) builds trees sequentially, where each new tree corrects the errors of the previous ensemble. The algorithm fits each new tree to the gradient of the loss function with respect to the current prediction:

\[\text{Step } t: \quad f_t(x) = \arg\min_f \sum_{i=1}^n L(y_i, \hat{y}_i^{(t-1)} + f(x_i))\]

where \(L\) is the loss function (e.g., log-loss for classification, squared error for regression) and \(\hat{y}_i^{(t-1)}\) is the prediction after \(t-1\) trees.

Key regularization hyperparameters: - learning_rate (shrinkage): how much each new tree contributes (0.01-0.3) - max_depth: tree complexity (typically 3-6 for boosted trees) - subsample: fraction of data used per tree (prevents overfitting) - min_child_weight: minimum sum of instance weights in a child node

Native missingness handling. Tree-based models handle missing values without imputation. At each split, the algorithm learns the optimal direction for missing values — sending them to the left or right child — by evaluating which split direction minimizes the loss on the training data. This is a major practical advantage over neural networks and linear models in omics, where missing data is common (failed assays, below-detection measurements, patient dropout in clinical covariates). The learned missingness direction captures non-random missingness patterns that imputation would destroy: for example, sicker patients systematically missing follow-up measurements is a clinical signal, not noise to be filled in.

Gradient boosting consistently outperforms random forests on tabular data but is more sensitive to hyperparameter tuning.

Categorical features: the CatBoost approach. Standard gradient boosting requires numerical encoding of categorical variables (one-hot for low cardinality, label encoding for high cardinality), but naive encoding introduces target leakage — the label mean of the training data flows into the encoding of test samples. CatBoost solves this with ordered target statistics: each category is encoded by the mean target value of previously seen training samples in a random permutation, preventing leakage while preserving categorical structure. For clinical variables (hospital site, treatment arm, genetic ancestry) and experimental metadata (batch, sequencing run), this native handling is both more accurate and simpler than manual encoding.

SHAP Values: Explainable Feature Importance

SHAP (SHapley Additive exPlanations) provides additive feature attribution based on Shapley values from cooperative game theory. For each prediction, SHAP decomposes it into:

\[\hat{y}_i = \phi_0 + \sum_{j=1}^p \phi_{ij}\]

where \(\phi_0\) is the baseline prediction (mean over training data) and \(\phi_{ij}\) is the contribution of feature \(j\) to prediction \(i\). Shapley values are the average marginal contribution of a feature across all possible feature coalitions — the only additive attribution method satisfying properties of local accuracy, missingness, and consistency.

SHAP plot types: - Summary plot: Beeswarm showing feature importance and effect direction across all samples. - Dependence plot: Feature value vs. SHAP value — reveals non-linear relationships and interaction effects. - Force plot: Individual prediction decomposition (used for clinical decision support).

Biological Interpretation

Variable importance tells you which features matter, but not how. A gene may have high importance because high expression → high risk, or because low expression → high risk. SHAP dependence plots reveal non-linear relationships and interaction effects: a gene that is neutral at low expression but strongly predictive at high expression is biologically different from one with a linear effect.

An interaction detected by SHAP (e.g., gene A’s effect depends on gene B’s expression) generates a testable biological hypothesis: are the two genes co-regulated? Do they share a pathway? Do their proteins physically interact? Each SHAP interaction should be traceable to a known biological relationship.

For regulatory compliance (FDA, EMA), model explainability is increasingly required for clinical decision support. SHAP-based explanations satisfy this requirement for tree-based models, making gradient boosting with SHAP the default choice for clinically deployable ML.

Current Landscape (Q2 2026)

CatBoost (categorical feature-native boosting) handles clinical variables (sex, ethnicity, treatment arm) without manual one-hot encoding, preserving categorical structure.
Explainable boosting machines (EBM) provide fully additive interpretable models that match XGBoost accuracy on tabular data while being fully decomposable into individual feature functions — making the model itself interpretable rather than requiring post-hoc explanation.
SHAP interaction values are being validated against known biology: a detected interaction should be traceable to a known pathway or physical interaction before being reported.
Regulatory guidelines (FDA, EMA) increasingly require model explainability for clinical decision support, making SHAP/EBM the default choice over black-box neural networks for clinical applications.

Summary and Required Reading

Decision trees partition feature space to minimize impurity — prone to overfitting without pruning.
Random forests average de-correlated trees (bagging + random feature subsets) for variance reduction.
Gradient boosting builds trees sequentially to correct residuals — dominant for tabular data (XGBoost, LightGBM, CatBoost).
SHAP values decompose predictions into additive feature contributions with theoretical guarantees — essential for biological interpretation.

Required Reading

Breiman: “Random Forests” (Machine Learning, 2001).
Lundberg & Lee: “A Unified Approach to Interpreting Model Predictions” (NeurIPS, 2017).

Johnson’s Rule: For tabular data, start with a tree. Neural networks will rarely beat gradient boosting.