Chapter 5: Linear Algebra and Dimensionality Reduction
Johnson’s First Principle: Omics Data is a Mathematical Matrix
An RNA-seq experiment with 20,000 genes across 500 samples is not a spreadsheet — it is a mathematical matrix \(X \in \mathbb{R}^{20{,}000 \times 500}\). Every sample is a vector in 20,000-dimensional space. The human brain cannot visualize anything beyond three dimensions, so we must rely on linear algebra to rotate and compress this matrix into a lower-dimensional space where biological signal separates from technical noise.
The fundamental assumption that makes this possible is that biological variation is low-rank: most of the 20,000 genes are not independently varying. A handful of biological processes — cell type, treatment response, cell cycle, stress — drive the majority of transcriptional variation, and each process affects hundreds of genes in coordinated patterns. The observed data \(X\) is approximately the sum of a low-rank biological matrix \(L\) (rank \(k \ll 20{,}000\)) and noise \(N\):
\[X \approx L + N\]
Dimensionality reduction is the search for \(L\) — the low-dimensional structure hidden inside the high-dimensional noise.
Core Concepts
From Covariance to Eigenvectors
The first question dimensionality reduction answers is: “in which directions does the data vary most?” The answer comes from the covariance matrix \(\Sigma\), where \(\Sigma_{ij}\) is the covariance between genes \(i\) and \(j\) across all samples. \(\Sigma\) captures the coordinated variation of all genes simultaneously.
An eigenvector \(\vec{v}\) of the covariance matrix satisfies:
\[\Sigma \vec{v} = \lambda \vec{v}\]
Interpretation: \(\vec{v}\) is a direction in gene expression space, and \(\lambda\) (the eigenvalue) is the amount of variance captured by that direction. The eigenvectors with the largest eigenvalues are the axes of maximum biological variation. The eigenvector with the largest eigenvalue is the single direction that captures the most variance; the second-largest is the next, orthogonal to the first; and so on.
This is the mechanical basis of dimensionality reduction: instead of examining 20,000 individual gene dimensions, we project the data onto the top \(k\) eigenvectors, which capture the most variance per axis.
Singular Value Decomposition (SVD)
SVD is a theorem proving that any matrix \(X \in \mathbb{R}^{n \times p}\) can be factored into three simpler matrices:
\[X = U D V^T\]
where: - \(U \in \mathbb{R}^{n \times n}\) contains left singular vectors — gene patterns (how genes co-vary) - \(V^T \in \mathbb{R}^{p \times p}\) contains right singular vectors — sample patterns (how samples relate) - \(D\) is a diagonal matrix of singular values \(d_1 \geq d_2 \geq \dots \geq d_k \geq 0\), each representing the magnitude of variation along the corresponding axis
(The SVD is often written \(U \Sigma V^T\) in other texts, but this chapter reserves \(\Sigma\) for the covariance matrix — the two are mathematically distinct, and using different symbols avoids confusion.)
The singular values are the bridge between SVD and the covariance matrix. The eigenvalues of \(X^T X\) (the sample covariance matrix, up to scaling) are the squares of the singular values: \(\lambda_i = d_i^2\). This means SVD simultaneously computes the directions of maximum variance (via \(V\)) and the projection of the data onto those directions (via \(UD\)).
Full vs. thin SVD. The description above gives the full SVD where \(U\) is \(n \times n\) and \(V\) is \(p \times p\). In practice, for \(n \gg p\) (more genes than samples), only the first \(p\) singular values are non-zero, and computing the full \(n \times n\) matrix \(U\) is wasteful. The thin SVD computes only the first \(k = \min(n, p)\) columns of \(U\) and rows of \(V^T\), producing \(U \in \mathbb{R}^{n \times k}\), \(D \in \mathbb{R}^{k \times k}\), and \(V^T \in \mathbb{R}^{k \times p}\). This is what svd(X) returns by default in R and scipy.linalg.svd with full_matrices=False in Python. For a 20,000-gene, 500-sample matrix, thin SVD computes a \(20{,}000 \times 500\) matrix \(U\) instead of \(20{,}000 \times 20{,}000\) — a 40x memory saving.
The Eckart-Young theorem is the reason SVD is the foundation of dimensionality reduction: the truncated SVD that keeps only the top \(d\) singular values gives the lowest possible reconstruction error of any rank-\(d\) approximation. In other words, SVD provides the optimal linear compression — no other linear method can preserve more information in \(d\) dimensions:
svd_result <- svd(X)
variance_explained <- svd_result$d^2 / sum(svd_result$d^2)
# Rank-2 approximation retains as much variance as mathematically possible
d <- 2
X_2 <- svd_result$u[, 1:d] %*% diag(svd_result$d[1:d]) %*% t(svd_result$v[, 1:d])
Principal Component Analysis (PCA)
PCA is SVD applied to a centered data matrix (the mean of each gene subtracted), which ensures that variance is measured around the biological baseline rather than around the origin:
- Center: \(X_{c, ij} = X_{ij} - \bar{X}_{i\cdot}\)
- Run SVD: \(X_c = U D V^T\)
- Principal component scores (sample positions in PC space): \(Z = U D\)
- Loadings (gene weights defining each PC): \(V\)
pca_result <- prcomp(X, center = TRUE, scale. = FALSE)
# pca_result$x: PC scores (samples × PCs)
# pca_result$rotation: Loadings (genes × PCs)
The PCA is optimal in the sense defined by Eckart-Young: the first PC captures the maximum possible variance in one dimension, the second captures the maximum remaining variance orthogonal to the first, and so on. This guarantee is what makes PCA the default starting point for any omics analysis.
The scree plot. The variance explained by each PC, plotted in descending order, is called a scree plot. The plot answers the question: “how many dimensions does the biological signal occupy?” If the plot shows a clear elbow — a sharp drop followed by a long, flat tail — the components before the elbow capture biological signal and those after capture noise. However, many real datasets show no clear elbow. This does not mean “no signal”; it means the signal is distributed across many components, which is expected for polygenic traits, branching developmental processes, or data with multiple overlapping sources of variation.
How many PCs to keep. There is no universal answer, but three heuristics guide the choice: - Cumulative variance threshold: keep enough PCs to explain 50-80% of total variance. In most omics datasets, this requires 10-50 PCs. The exact threshold depends on the downstream task — clustering benefits from more PCs (capturing subtle structure), while visualization requires only 2-3. - Parallel analysis: compare the observed eigenvalues to those from a null matrix (random data of the same dimensions). PCs with eigenvalues exceeding the null distribution are retained — this is more principled than the elbow heuristic but computationally heavier. - Purpose-driven: for downstream use (batch correction, UMAP initialization), keep 30-50 PCs regardless of the elbow. The additional PCs contribute small but cumulatively meaningful variance, and modern methods are robust to including a few noise-dominated dimensions.
Beyond PCA: MDS and LDA
Multi-Dimensional Scaling (MDS) generalizes PCA to arbitrary distance metrics. While PCA uses Euclidean distance implicitly (via covariance), MDS can preserve Bray-Curtis distances (standard for microbiome data), identity-by-state distances (population genetics), or any user-defined dissimilarity. Classical MDS (PCoA) with Euclidean distance is equivalent to PCA. The tradeoff: MDS with non-Euclidean distances can reveal structure invisible to PCA, but the axes lack the direct variance interpretation that makes PCA loadings interpretable.
Linear Discriminant Analysis (LDA) is supervised: it finds axes that maximize the ratio of between-group to within-group variance, separating known groups optimally. LDA is not a dimensionality reduction method for exploration — it is a classification method that happens to produce a low-dimensional projection. With \(p \gg n\) (more genes than samples), LDA overfits catastrophically and should be applied to PCA scores, not raw genes.
Both MDS and LDA should be used after PCA, not instead of it. PCA compresses the data to a tractable number of dimensions (typically 50); MDS or LDA on those scores adds specific analytical value (non-Euclidean distances or supervised separation) that PCA cannot provide.
Biological Interpretation
Color your PCA by batch before you color it by biology. If batch effects dominate variance, PC1 = batch, not biology. This is the single most important diagnostic step in any omics analysis. If PC1 separates sequencing batches but not experimental conditions, the data has a confound that must be addressed through batch correction (ComBat, limma) before any biological interpretation. This diagnostic is so important that it has its own chapter (Chapter 6).
PCA is linear. Biological relationships that are non-linear in expression space — developmental trajectories, cyclic processes (cell cycle), branching differentiation — will not be captured by the first few principal components. A scree plot with no clear elbow does not mean “no biological signal” — it may mean the signal is non-linear. For such data, non-linear methods (t-SNE, UMAP) are appropriate, but always start with PCA to compress to 50 dimensions first.
Loading vectors are biologically interpretable. The genes with the largest absolute loadings on PC1 are the primary drivers of the largest transcriptional variation in the dataset. In breast cancer data, PC1 loadings are dominated by ER signaling genes (ESR1, GATA3, FOXA1). In a study with a strong batch effect, PC1 loadings may be dominated by ribosomal or mitochondrial genes, indicating technical artifact. Examining the top 20 loadings of PC1 and PC2 should be a routine step in any PCA analysis — it connects the mathematical projection back to the biology.
Current Landscape (Q2 2026)
- Contrastive PCA (cPCA) identifies axes of variation specific to a target dataset compared to a control dataset, outperforming standard PCA for finding disease-specific signal while removing background variation from the control.
- Sparse PCA and NSFH (non-negative matrix factorization alternatives) improve interpretability by enforcing sparsity on loadings, making each component driven by a smaller, more interpretable gene set.
- Randomized SVD algorithms (sklearn’s
randomized_svd) enable PCA on million-sample single-cell datasets without loading the full matrix into memory, using random projections to approximate the top singular values. - Multi-omic factorization (MOFA+) extends PCA-like decomposition to multiple data modalities simultaneously, extracting factors that explain variation across RNA, ATAC, protein, and methylation layers.
Summary and Required Reading
- Omics data is a matrix — biological variation is low-rank. SVD decomposes \(X = U D V^T\), providing the optimal rank-\(k\) approximation (Eckart-Young theorem).
- PCA = SVD on centered data — PC scores = \(U D\) (sample positions), loadings = \(V\) (gene weights).
- Singular values connect to covariance — \(\lambda_i = d_i^2\) are the eigenvalues of \(X^T X\), and the eigenvectors identify axes of maximum variance.
- The scree plot estimates intrinsic dimensionality — look for the elbow. No elbow means the signal is distributed or non-linear.
- Color by batch before coloring by biology — if PC1 separates batches, the data has a confound. Examine loading vectors to interpret what each PC represents.
- PCA is linear — use it for compression to 50 dimensions before non-linear methods (t-SNE, UMAP).
Required Reading
- Strang: Linear Algebra and Learning from Data (Ch 1-3) — SVD and PCA.
- Alter, Brown & Botstein: “Singular value decomposition for genome-wide expression data processing and modeling” (PNAS, 2000).
Johnson’s Rule: Always color your PCA by batch before you color it by biology. If you skip this step, you have no idea whether PC1 is biology or a confound.