Chapter 15: Single-Cell RNA-seq and Dimensionality Escape

Published

June 5, 2026

Modified

June 19, 2026

Johnson’s First Principle: Bulk RNA-seq is a Fruit Smoothie

Bulk RNA-seq crushes 100,000 cells into a tube and sequences the average. A T-cell and a neuron co-ingested in a bulk sample produce an average expression profile that corresponds to no biological reality. Cell types are discrete physical identities, and averaging destroys the signal of rare but biologically critical populations.

Single-cell RNA-seq escapes this averaging problem but creates three new ones in its place: sparsity (90% of entries are zero), dimensionality (30,000 genes measured per cell), and ambiguity of measurement units (UMI counts vs. read counts vs. molecules). The statistical challenge of single-cell analysis is not the sequencing; it is recovering biological structure from a data matrix where most entries are zero, most genes carry no signal, and the unit of replication is a sample, not a cell.

Core Concepts

Microfluidics and Barcoding: The 10x Chromium

Three fluid streams converge in a microfluidic channel: cells in suspension, barcoded hydrogel beads, and oil. At the junction, droplets form spontaneously; each GEM (Gel Bead-In-Emulsion) contains approximately 0 or 1 cell plus 1 bead carrying a unique barcode.

Poisson loading. The number of cells per droplet follows a Poisson distribution. To minimize doublets (two cells in one droplet), the cell concentration is kept low so that most GEMs contain 0 cells. The fraction of GEMs containing at least one cell is \(P(\geq 1) = 1 - e^{-\lambda}\). At \(\lambda = 0.1\), only ~9.5% of GEMs have a cell, but among those, the doublet rate \(P(\geq 2|\geq 1) \approx \lambda/2 \approx 5\%\), meaning ~95% of occupied GEMs are singlets. This is a deliberate tradeoff: lowering the cell concentration reduces doublets but also reduces the number of cells captured per run. Standard 10x runs target ~10,000 cells with an expected doublet rate of ~4-8% at \(\lambda \approx 0.05\)-\(0.1\).

UMI collapse: from reads to molecules. Each mRNA molecule receives a unique molecular identifier (UMI) (a random 10-12 bp sequence) during reverse transcription. After sequencing, reads sharing the same cell barcode, gene, and UMI are collapsed into a single count. This is conceptually transformative: 10 sequencing reads with the same UMI represent 1 original mRNA molecule, not 10. Without UMIs, PCR duplicates inflate apparent expression, and the count matrix reflects amplification history as much as biological abundance.

The consequence is that scRNA-seq count distributions differ from bulk RNA-seq in a subtle but important way. Chapter 4 established that bulk RNA-seq counts follow a Negative Binomial distribution: \(\text{Var}(Y) = \mu + \alpha \mu^2\), where the quadratic term \(\alpha \mu^2\) captures both technical (amplification) and biological (cell-to-cell) variation. In scRNA-seq with UMIs, amplification noise is largely removed by UMI collapse (a single molecule either is captured and receives a UMI or is not. This makes the capture process approximately Poisson: \(\text{Capture} \sim \text{Poisson}(\lambda_{\text{capture}} \cdot \text{mRNA}_{\text{cell}})\). But the biological parameter \(\lambda\) itself varies across cells) some cells express a gene at 10 copies, others at 0. If \(\lambda\) follows a Gamma distribution across the cell population (the natural model for cell-to-cell expression variability), then the mixture of \(\text{Poisson}(\lambda)\) with Gamma-distributed \(\lambda\) yields exactly the Negative Binomial from Chapter 4: \(\text{Var}(Y) = \mu + \alpha\mu^2\). The critical difference from bulk RNA-seq: the dispersion \(\alpha\) in scRNA-seq reflects biological variability alone (amplification noise is removed by UMI collapse), while in bulk RNA-seq \(\alpha\) captures both technical amplification noise and biological variability. The UMI removes the amplification variance term but not the biological variance.

The Sparsity Crisis

Single-cell data is ~90% zero entries. A gene expressed at 10 mRNA copies per cell in bulk RNA-seq may appear in only 1-10% of individual cells. Understanding why requires distinguishing two sources of zeros:

Technical dropout: The mRNA molecule existed in the cell but was not captured (failed to enter the droplet, was not reverse-transcribed, or was not sequenced). At typical capture efficiencies of 10-30%, a cell expressing a gene at 10 copies will on average capture 1-3 molecules and miss 7-9.
Biological zeros: The gene is genuinely not expressed in this cell; the transcription rate is zero in that cell type or state.

The distinction matters because they require different handling. Technical dropout is a sampling problem (increase capture efficiency or sequencing depth), while biological zeros are the signal of cell-type specificity. Most scRNA-seq analysis does not attempt to separate them explicitly; instead, clustering methods rely on the observation that highly expressed cell-type marker genes have sufficient capture to define clusters, while dropout-affected genes simply contribute noise that the dimensionality reduction step filters out.

Capture efficiency and its consequences. Standard 10x Chromium captures ~10-30% of mRNA molecules per cell. This means for a gene with 10 mRNA copies per cell, the expected observed count is 1-3 UMI counts in expressing cells, and zero in 70-90% of cells that do express it. The expected fraction of cells with at least one UMI is \(1 - e^{-\lambda}\), where \(\lambda\) is the true mean expression multiplied by capture efficiency. At \(\lambda = 1\), ~63% of cells show at least one count; at \(\lambda = 0.1\), only ~9.5% do. The observed sparsity is thus a direct consequence of the physical capture efficiency; it is not a modeling choice.

The zero-inflated Negative Binomial (ZINB) models both sources of zeros as a mixture: a point mass at zero (dropout) plus an NB distribution for the expressed component. However, debate continues over whether zero-inflation is needed at all, with UMIs and moderate expression levels, the observed zeros may be adequately explained by Poisson sampling alone, without an explicit zero-inflation component. The practical implication: ZINB-based normalization methods (SCTransform, ZINB-WaVE) are standard but should be validated against simpler NB models for the specific dataset.

From Counts to Structure: The Dimensionality Escape

A typical scRNA-seq experiment produces a count matrix of ~30,000 genes \(\times\) ~10,000 cells (300 million entries, 90% of which are zero. Working directly in 30,000-dimensional space is computationally intractable and statistically meaningless (the curse of dimensionality: distances between points converge in high dimensions). The solution is the dimensionality escape) successive compression steps that reduce the data to a biologically interpretable low-dimensional representation.

Step 1: Feature selection. Most of the 30,000 genes are not informative. The first step selects genes with high variance relative to their mean (dispersion), typically 2,000-5,000 highly variable genes (HVGs). The rationale: a gene expressed uniformly at 1 count per cell across all cells carries no information about cell identity; a gene with mean 1 count and variance 10 is biologically informative. Feature selection reduces the matrix from 30,000 \(\times\) 10,000 to 3,000 \(\times\) 10,000 while retaining the biologically relevant dimensions.

Step 2: Normalization. Raw UMI counts must be normalized to account for differences in sequencing depth across cells. A cell with 50,000 total UMIs and a cell with 5,000 total UMIs will have systematically different counts for every gene, ignoring this confounds biological variation with technical depth.

Library-size normalization (scran): Compute size factors per cell by pooling cells with similar expression profiles, then normalize counts by dividing by the size factor. This is the scRNA-seq analogue of DESeq2’s median-of-ratios normalization from Chapter 4.
SCTransform: Fits a regularized Negative Binomial (ZINB) model per gene, regressing out sequencing depth. The Pearson residuals from the model are the normalized expression values. SCTransform handles the mean-variance relationship explicitly, removing the need for separate HVG selection; it replaces Steps 1 and 2 with a single regularized regression.
Log-normalization: \(\log(1 + \text{CPM})\); simple and fast, but distorts the mean-variance relationship and is not recommended as the sole normalization step.

Recommendation: SCTransform is the best default for most single-cell analyses; its ZINB-based residuals handle dropout and overdispersion jointly, remove the need for separate HVG selection, and are well-behaved for PCA. Use scran’s pooling-based size factors when you need to preserve absolute expression levels for cross-condition differential expression, where residual-based values can be harder to interpret.

The normalization choice determines what signal the downstream steps see. Under-normalization leaves depth confounders in the data, and the first principal components will capture sequencing depth rather than biology. Over-normalization (aggressive regression of all technical factors) can remove biological signal correlated with library complexity, such as cell-cycle effects or activation states.

Step 3: PCA compression. The normalized matrix (3,000 genes \(\times\) 10,000 cells) is reduced to ~50 principal components. Why does PCA work on a sparse, zero-inflated matrix? Because genes are correlated (the covariance structure exists in the sparse matrix because co-expressed genes vary together across cells. A gene expressed in only 5% of cells contributes to the covariance when it is expressed; the zeros simply add noise that the top PCs average out. The first PC typically captures a dominant source of variation (cell type, cell cycle, or technical artifact) inspect it before trusting it). The number of PCs to retain is chosen by the elbow heuristic: plot the variance explained per PC and keep the PCs before the curve flattens. For 10x data, this is typically 15-50 PCs. Too few PCs discard rare cell types; too many introduce noise that degrades clustering. PCA compresses 3,000 dimensions to ~50 while preserving the structure needed for clustering.

Step 4: Graph construction and clustering. The 50-dimensional PCA embedding is used to construct a K-nearest neighbor (KNN) graph: each cell is a node connected to its \(k\) nearest neighbors (typically \(k = 10\)-\(30\)) by Euclidean distance in PCA space. The graph is the key data structure because it captures neighborhood relationships without requiring cells to be close in the original 30,000-dimensional space; two cells of the same type that are far apart in gene space can be connected through intermediate cells.

The Leiden algorithm optimizes modularity on this graph by moving cells between communities until the partition is stable. The resolution parameter controls cluster granularity: higher resolution produces more clusters, but each cluster must be validated biologically. The algorithm will find clusters in random noise; cluster existence is a property of the algorithm, not necessarily of the data.

UMAP: Visualization, Not Inference

UMAP (Uniform Manifold Approximation and Projection) builds a fuzzy simplicial complex from the KNN graph in PCA space and optimizes a low-dimensional (2D) embedding. Compared to t-SNE, UMAP preserves more global structure, is computationally faster, and produces more deterministic embeddings. The key parameters: n_neighbors (higher = more global structure preserved) and min_dist (lower = tighter cluster packing).

Critical caveats: - UMAP is a visualization, not a clustering. Different random seeds produce different visual arrangements even with the same data. - Cluster sizes and distances in UMAP are not interpretable; the algorithm can produce visually separated clusters from continuous gradients. - The same UMAP parameters applied to different random seeds or subsamples produce visually different plots, yet the biological interpretation should be the same.

t-SNE (t-distributed Stochastic Neighbor Embedding) is an older alternative that converts high-dimensional distances to probabilities and minimizes KL divergence. t-SNE preserves local structure well but has additional limitations: cluster positions and sizes are not interpretable, perplexity choice is non-obvious, and different random seeds produce substantially different outputs.

Doublet Detection

Doublets (two cells in one droplet) produce expression profiles that are combinations of two distinct cell types. They appear as intermediate cells in UMAP and form spurious clusters. Computational detection (Scrublet, DoubletFinder): - Simulates doublets from random pairs of real cells - Trains a classifier to distinguish simulated vs. observed cells based on the adjacency of their expression profiles to other cells - The expected doublet rate is typically 3-8% per 10x channel

Doublets cannot be completely eliminated by loading concentration. Even at optimal loading, physical doublets occur. Joint analysis of combined channels amplifies the problem: a doublet between a T-cell and a macrophage in one channel appears as a valid intermediate state unless detected and removed.

Biological Interpretation

Never trust a cluster you cannot name. If you cannot find known marker genes for a cluster, it may be a doublet, a batch effect, or a technical artifact. UMAP is a visualization, not a clustering; different random seeds produce different visual arrangements, and the resolution parameter is a tuning decision that must be justified biologically, not computationally.

Pseudo-bulk aggregation (summing counts across all cells of a type per sample) restores the unit of replication; the biological sample, not the cell. Treating each cell as an independent replicate inflates statistical power by orders of magnitude and produces irreproducible results. Differential expression from single-cell data must use pseudo-bulk approaches or mixed models that account for the sample-level correlation structure. A dataset with 3 cases and 3 controls, each with 5,000 cells, has 6 replicates, not 30,000.

Cluster annotation is the most subjective step in scRNA-seq analysis. Expression of a single canonical marker (e.g., CD3D for T-cells) is suggestive but not definitive; CD3D fragments may align to a different gene in a different species, or a macrophage may have picked up T-cell mRNA from phagocytosis. Multi-marker validation, cross-reference with published atlases, and consistency across samples are required for rigorous annotation.

Summary and Required Reading

Microfluidic barcoding + UMIs enable molecular counting of 10,000+ cells per experiment, eliminating PCR bias. The capture process is approximately Poisson; remaining biological overdispersion motivates NB models.
The dimensionality escape proceeds in four steps: feature selection (HVGs) → normalization (scran/SCTransform) → PCA (50 dimensions) → KNN graph + Leiden clustering. Each step compresses the representation toward biological structure.
Sparsity (90% zeros) combines technical dropout (low capture efficiency) and biological zeros (cell-type specificity). ZINB models attempt to separate them, but debate continues on whether explicit zero-inflation is necessary with UMIs.
PCA on sparse data works because genes are correlated; the covariance structure exists, and the top PCs extract the dominant biological signals.
Pseudo-bulk analysis is required for valid differential expression; cells within a sample are not independent replicates.
Cluster annotation requires multi-marker validation: a single marker gene is not sufficient for cell-type identification.

Required Reading

Zheng et al.: “Massively parallel digital transcriptional profiling of single cells” (Nature Communications, 2017).
McInnes et al.: “UMAP: Uniform Manifold Approximation and Projection” (Journal of Open Source Software, 2018).
Luecken & Theis: “Current best practices in single-cell RNA-seq analysis” (Molecular Systems Biology, 2019).

Johnson’s Rule: Never trust a cluster you cannot name.