Chapter 21: Machine Learning Fundamentals for Biology

Johnson’s First Principle: The Illusion of Memorization

A model trained on 50 patients with 20,000 genes can achieve 99.9% accuracy — even with randomly shuffled labels. When \(p > n\), a linear separator exists for any labeling. This is not learning; it is exploiting excess degrees of freedom. The only metric that matters is generalization to unseen data.

Chapter 6 established the statistical framework for inference: hypothesis testing, multiple testing correction, and confounder management. That framework answers “is there a difference?” Machine learning answers a different question: “can I predict this outcome from these measurements?” The two are complementary — statistical inference identifies associations, machine learning tests whether those associations are predictive in independent data. A gene with p-value \(10^{-10}\) that does not generalize to a held-out cohort has statistical significance without practical utility.


Core Concepts

The Curse of Dimensionality (\(p \gg n\))

In bioinformatics, you typically have \(n = 50\) patients and \(p = 20,000\) genes. When \(p > n\), a fundamental theorem guarantees: \(n\) points in \(\mathbb{R}^p\) span at most \(n-1\) dimensions, providing unlimited degrees of freedom to orient a separating hyperplane. For any binary labeling — even random coin flips — a linear separator exists. This is not learning; it is dimensionality exploitation.

The concentration of distances: In high dimensions, all pairwise distances converge to the same value. The ratio of nearest to farthest neighbor distance approaches 1 as \(p \to \infty\). K-nearest neighbors fails because there are no “nearest” neighbors.

Minimum sample size heuristic: A stable linear classifier requires \(n \geq 10 \cdot p\). With \(p = 20,000\), you need \(n \geq 200,000\) samples. Since this is rarely achievable in omics, aggressive dimensionality reduction to \(p \leq n/10\) is essential before any supervised learning.

Regularization is the primary solution to \(p > n\). L1 regularization (Lasso) drives irrelevant feature weights to exactly zero, performing embedded feature selection. L2 regularization (Ridge) shrinks all weights toward zero without eliminating features. Elastic Net combines both. In the \(p > n\) regime, unregularized models are mathematically guaranteed to overfit — regularization encodes the prior belief that most features are irrelevant.

The Bias-Variance Tradeoff

Total prediction error decomposes into three components:

\[\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

  • Bias: Error from incorrect assumptions. A linear model fitted to non-linear data has high bias — it systematically misses the true relationship.
  • Variance: Error from sensitivity to training data. A deep decision tree has high variance — change 5% of the training samples and the tree structure changes completely.
  • Irreducible error: Measurement noise that cannot be modeled away.

The tradeoff: as model complexity increases (more parameters, less regularization), bias decreases but variance increases. The optimal model minimizes total error at the point where the marginal decrease in bias equals the marginal increase in variance. In omics, where \(n\) is small, variance dominates unless controlled through regularization, dimensionality reduction, or simple models.

Regularization exploits this tradeoff directly. L1 regularization (Lasso) drives weights to exactly zero — this increases bias (the model cannot use those features at all) but collapses variance by eliminating the dimensions where noise dominates. L2 regularization (Ridge) shrinks all weights toward zero, increasing bias slightly (the constrained fit is less accurate) while reducing variance proportionally to the number of correlated features. Elastic Net combines both. In the \(p > n\) regime where variance is extreme, even a small bias investment yields a large variance reduction — the net effect is lower total error. Cross-validation finds the optimal regularization strength \(\lambda\) by balancing this tradeoff.

Data Leakage: The Most Common Fatal Error

Data leakage occurs when information from the test set influences the training process. Common patterns:

  1. PCA before split: Computing PCA on all 50 patients, then splitting. PCA rotation was influenced by test data.
  2. Feature selection before split: Selecting differentially expressed genes using all samples, then splitting.
  3. Batch correction before split: Applying ComBat to full dataset. Test data influenced batch parameter estimates.
  4. Normalization before split: Computing mean/variance from all data including test.
  5. Feature scaling before split: Standardizing (z-scores) or min-max scaling on all data before splitting leaks test information into the scaling parameters.

Correct workflow: All preprocessing (PCA, feature selection, normalization, batch correction, feature scaling) must be fit on training data only and transformed onto test data. Regularization and PCA are both sensitive to feature scale — a gene measured in TPM (~0-10,000) will dominate over a clinical variable like age (~0-100) if not standardized. Standard practice: transform each feature to z-scores using training-set mean and standard deviation, then apply the same transformation to the test set. sklearn Pipeline enforces this:

# WRONG — data leakage
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # fit on ALL data
X_train, X_test = train_test_split(X_scaled, y)

# RIGHT — no leakage
X_train, X_test = train_test_split(X, y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit on TRAIN only
X_test_scaled = scaler.transform(X_test)          # transform TEST

Model Validation

Train/test split: The test set is used exactly once, at the end of all development. If you tune hyperparameters based on test performance, your reported performance is biased upward.

K-fold cross-validation: Partition data into \(K\) folds (typically 5 or 10), train on \(K-1\), evaluate on the held-out fold, repeat. Report mean \(\pm\) standard deviation.

Nested cross-validation: Required when tuning hyperparameters. The outer loop (e.g., 5-fold) holds out 20% of data as a test fold. The inner loop runs cross-validation on the remaining 80% to select optimal hyperparameters. A model is trained with those optimal parameters on the 80% and evaluated on the held-out 20%. This repeats for each outer fold; the average outer-loop performance is the unbiased estimate. Without nesting, hyperparameter optimization leaks information — tuning on the full dataset and reporting CV performance on the same data is effectively reporting training performance.

Permutation test: Train the model on label-shuffled data. If the model achieves AUC > 0.5 on shuffled labels, the pipeline is detecting spurious structure, not biology. Run 100+ permutations to establish the null distribution. Caveat: a model that passes the permutation test (performance exceeds chance on this dataset) may still fail external validation — it may have learned dataset-specific artifacts (batch, hospital, sequencing protocol) rather than generalizable biology. The permutation test confirms the model learned structured patterns, not that those patterns are biological.

Metrics: Why Accuracy Lies

In a dataset of 99 healthy and 1 diseased patient, a model predicting “healthy” for everyone achieves 99% accuracy. It is completely useless.

Metric Definition Use Case
Precision (PPV) TP / (TP + FP) Minimize false alarms
Recall (Sensitivity) TP / (TP + FN) Minimize missed diagnoses
Specificity TN / (TN + FP) True negative rate
F1 Score \(2 \cdot \frac{P \cdot R}{P + R}\) Imbalanced classification
ROC-AUC TPR vs. FPR across thresholds Balanced classes
PR-AUC Precision vs. Recall across thresholds Rare positive class

ROC-AUC caveat: AUC can be high even with poorly calibrated probabilities. Brier score measures calibration: \(\text{Brier} = \frac{1}{N} \sum (y_i - \hat{p}_i)^2\). A well-calibrated model has Brier close to 0. A model that outputs 0.99 for everything has high AUC but poor calibration.

Johnson’s Rule: If your classes are imbalanced (which they almost always are), read Precision-Recall, not ROC-AUC. The reason: ROC-AUC plots TPR against FPR, and FPR is dominated by the majority class — a model that misses all positives can still achieve high ROC-AUC if it correctly labels most negatives. Precision focuses on the positive class directly: a false positive among the rare positives is a large penalty. PR-AUC detects poor positive-class performance that ROC-AUC misses.


Biological Interpretation

A model trained on 10 samples from one hospital is not a biomarker — it is a hypothesis. Data leakage is the most common undetected error in published ML biomarkers. The most frequent form: normalizing or selecting features using all data before splitting, which leaks information from the test set into training. A model that works in one cohort but fails in another has not learned biology — it has learned hospital-specific or cohort-specific artifacts.

External validation on an independent cohort is the only non-negotiable requirement for a clinically useful biomarker. Cross-validation on a single cohort is not sufficient — it cannot detect systematic bias (hospital-specific processing, population-specific allele frequencies, platform-specific error profiles).

The noise-as-signal problem is well-documented: with enough features, models achieve apparently perfect cross-validation on random data. A permutation test is the minimum sanity check. Even with proper validation, a model that achieves AUC 0.95 on a held-out test set from the same hospital may drop to AUC 0.55 when tested at a different hospital — the features it learned were hospital-specific artifacts, not biology.


Current Landscape (Q2 2026)

  • The noise-as-signal problem in high-dimensional omics ML is well-documented: models can achieve apparently perfect cross-validation on random data with enough features and no regularization, leading to a replication crisis in ML-based biomarker studies.
  • Federated learning (training across hospitals without sharing raw data) enables multi-site training without privacy violations, addressing the fundamental \(n\) limitation by pooling samples across institutions.
  • Conformal prediction provides prediction sets with guaranteed coverage, making ML predictions more clinically actionable by quantifying prediction uncertainty — a prediction is not just a class label but a set of possible outcomes with a stated confidence.

Summary and Required Reading

  1. Curse of dimensionality (\(p \gg n\)) guarantees a linear separator for any labeling — aggressive dimensionality reduction to \(p \leq n/10\) or regularization (L1/L2) is essential.
  2. The bias-variance tradeoff means total error = bias² + variance + irreducible error. In omics, variance dominates — regularization controls it by trading a small bias increase for a large variance reduction.
  3. Data leakage is the most common fatal error — all preprocessing (scaling, PCA, normalization, feature selection, batch correction) must be fit on training data only and transformed onto test data.
  4. Accuracy is meaningless with class imbalance — Precision, Recall, F1, PR-AUC are correct metrics. PR-AUC detects poor positive-class performance that ROC-AUC misses because FPR is dominated by the majority class.
  5. Permutation tests validate that performance exceeds chance — 100+ random label shuffles establish the null distribution. Passing the permutation test does not guarantee external validity.

Required Reading

  • Hastie, Tibshirani & Friedman: Elements of Statistical Learning (Ch 7 — Model Assessment and Selection).
  • Lever, Krzywinski & Altman: “Points of Significance: Model selection and overfitting” (Nature Methods, 2016).

Johnson’s Rule: A model trained on 10 samples from one hospital is not a biomarker. It is a hypothesis.