Chapter 4: Probability Distributions in Sequencing
Johnson’s First Principle: Read Counts are Random Variables
If you take a tube of RNA, prepare two libraries, and sequence them identically, you will not get the same number of reads for any gene. A gene that yields 450 reads in one library might yield 420 in another. This is not measurement error in the usual sense — the sequencer did not malfunction. The difference arises from random sampling: from a pool of billions of RNA fragments, the sequencer captures a tiny, random subset.
A novice treats 450 as a fixed measurement. A professional recognizes it as a single observation of a random variable drawn from an underlying probability distribution. Before you can declare a gene “upregulated,” you must prove that the observed difference exceeds the expected noise of that distribution. And before you can do that, you must know which distribution governs read counts.
Core Concepts
Discrete Data Requires Discrete Distributions
Sequencing produces counts. A gene is either observed 0, 1, 2, … times. It cannot be observed 450.7 times. This discreteness rules out the most familiar distribution in statistics — the Gaussian — for two reasons:
- The Gaussian assigns probability density to impossible values: negative counts and fractional counts.
- The Gaussian assumes variance is independent of the mean, but count data exhibits a systematic mean-variance relationship: genes with higher expression have larger absolute variance.
Every distribution used in RNA-seq analysis must respect the discrete, non-negative, integer-valued nature of the data. This constraint narrows the candidate distributions dramatically.
The Poisson Process: Modeling Sampling Noise
A sequencing experiment is a physical sampling process. From a homogenized pool containing billions of RNA fragments, the sequencer withdraws a subsample and reads each captured fragment. This is, by construction, a Poisson process: independent events occurring at a constant average rate.
If a gene constitutes a fraction \(p\) of the total RNA pool and the sequencer draws \(N\) total reads, the expected count is \(\lambda = pN\). The probability of observing exactly \(k\) reads from that gene follows the Poisson distribution:
\[P(k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
The defining property of the Poisson is that variance equals the mean:
\[\text{Var}(X) = E[X] = \lambda\]
This means: if you resequence the same library (a technical replicate), the counts for a given gene should scatter around the mean with variance approximately equal to the mean. For technical replicates, the Poisson fits remarkably well — the only source of variation is the random sampling inherent in the sequencing process.
Johnson’s Rule: The Poisson is the correct model for technical replicates. It is almost always the wrong model for biological replicates.
Overdispersion: When Variance Exceeds the Mean
Now consider three biological replicates — three different mice from the same treatment group — with expression counts for a single gene:
\[\text{Replicate counts: } 100, 300, 500\] \[\text{Mean } \mu = 300, \quad \text{Variance } \sigma^2 = 40{,}000\]
The Poisson predicts variance \(= 300\). The actual variance is 133 times larger. This is overdispersion: \(\sigma^2 \gg \mu\).
Where does this extra variance come from? The three mice do not have identical biological states. Their gene expression levels differ due to genetic background, environmental history, stochastic cellular noise — all the sources of biological variation that make organisms distinct. The sequencing step adds Poisson sampling noise on top of this biological variation.
Total variance decomposes into two components:
\[\text{Var}(\text{observed count}) = \text{Var}(\text{biological signal}) + \text{Var}(\text{sampling noise})\]
The biological signal itself varies across replicates. The sampling noise — which is Poisson — operates on each replicate’s individual biological state. A distribution that accounts for both components must allow the variance to exceed the mean by an adjustable amount.
The Negative Binomial: Poisson with a Random Rate
The Negative Binomial (NB) distribution adds exactly one parameter — the dispersion \(\alpha\) — to capture overdispersion. Its critical variance formula is:
\[\text{Var}(Y) = \mu + \alpha\mu^2\]
When \(\alpha = 0\), the NB collapses to the Poisson. As \(\alpha\) increases, the quadratic term \(\alpha\mu^2\) rapidly dominates, allowing the variance to far exceed the mean.
The NB emerges naturally from a Poisson-Gamma mixture. Imagine the biological process as follows:
- Each biological replicate \(i\) has its own underlying expression rate \(\lambda_i\) for a given gene. These rates vary across replicates because the organisms are not identical.
- The \(\lambda_i\) values follow a Gamma distribution with mean \(\mu\) and variance \(\alpha\mu^2\). The Gamma is the natural choice because it is continuous, always positive, and can take a wide range of shapes depending on \(\alpha\).
- Each replicate’s observed count \(Y_i\) is drawn from \(\text{Poisson}(\lambda_i)\) — that is, the sequencer samples from that replicate’s RNA pool, and the count is Poisson-distributed around that replicate’s true rate.
Aggregating across replicates, the observed counts follow the NB. The variance decomposition uses the Law of Total Variance — the identity that splits the total variance of a hierarchical process into expected within-group variance plus between-group variance:
\[\text{Var}(Y) = \underbrace{E[\text{Var}(Y|\lambda)]}_{\text{within-replicate (Poisson noise)}} + \underbrace{\text{Var}(E[Y|\lambda])}_{\text{between-replicate (biological variation)}}\]
Applying this to the Poisson-Gamma hierarchy makes the \(\alpha\mu^2\) term transparent:
\[\text{Var}(Y) = E[\lambda] + \text{Var}(\lambda)\] \[\text{Var}(Y) = \mu + \alpha\mu^2\]
The quadratic term \(\alpha\mu^2\) arises because biological variation scales with expression level: a highly expressed gene has more “room” for its rate to vary across replicates than a lowly expressed one.
The full NB probability mass function is:
\[P(Y = y) = \frac{\Gamma(y + 1/\alpha)}{y! \, \Gamma(1/\alpha)} \left(\frac{1/\alpha}{1/\alpha + \mu}\right)^{1/\alpha} \left(\frac{\mu}{1/\alpha + \mu}\right)^y\]
While the equation appears formidable, the practical contribution of \(\alpha\) reduces to a single intuition: it controls how much biological noise inflates the variance beyond what sampling alone would produce.
The table below quantifies this for a gene with mean expression \(\mu = 100\):
| \(\alpha\) | Meaning | Variance (\(\mu=100\)) | \(\sigma\) | Typical CV |
|---|---|---|---|---|
| 0 | Technical replicates only (Poisson) | 100 | 10 | 10% |
| 0.01 | Very well-controlled experiment | 200 | 14 | 14% |
| 0.05 | Typical cell-line experiment | 600 | 24 | 24% |
| 0.1 | Moderate biological noise | 1,100 | 33 | 33% |
| 0.5 | Highly variable system (immune genes) | 5,100 | 71 | 71% |
How to read this table in practice: In a well-controlled mouse knockout study with 3 biological replicates per group, most protein-coding genes have \(\alpha\) between 0.01 and 0.1. Cytokine genes, immune receptors, and cell-cycle regulators often exceed 0.5 — their expression is genuinely more variable across individuals. The NB is not flagging them as “noisy”; it is correctly modeling their true biological variability. The \(\alpha\) parameter separates the question “is this gene variable?” from “is this gene differentially expressed?” — two questions that a Poisson-based analysis would conflate.
From Distribution to Inference: The Estimation Problem
The NB gives us the correct distributional form. But differential expression requires estimating \(\mu\) (expression level) and \(\alpha\) (dispersion) for each gene from the data. Here a practical problem emerges.
With \(n = 3\) biological replicates per condition, we have 3 data points from which to estimate both \(\mu\) and \(\alpha\) for each gene. Estimating two parameters from three observations is statistically perilous: per-gene dispersion estimates are wildly unreliable at low replication. A gene whose three counts happen to be 101, 102, 103 would produce \(\alpha \approx 0\) under per-gene estimation, while a gene with counts 50, 100, 150 produces a large \(\alpha\). Neither estimate reflects the true dispersion reliably.
The key insight that makes RNA-seq analysis work: dispersion varies smoothly with mean expression. Highly expressed genes tend to have lower dispersion; lowly expressed genes tend to have higher dispersion, partly because the counts are noisier and partly because lowly expressed genes include more biological noise. This mean-dispersion relationship means we can borrow information across genes.
DESeq2: Empirical Bayes Dispersion Estimation
DESeq2 operationalizes this insight through empirical Bayes shrinkage. The workflow proceeds in three steps:
- Per-gene dispersion estimation: For each gene \(i\), estimate \(\alpha_i\) from its count data alone. These raw estimates are noisy but unbiased.
- Trend fitting: Fit a smooth curve through the cloud of per-gene dispersion estimates plotted against mean expression. The curve captures the systematic relationship: \(\alpha\) decreases with \(\mu\).
- Shrinkage: Shrink each per-gene estimate \(\hat{\alpha}_i\) toward the trend line. Genes whose per-gene estimate is unreliable (typically lowly expressed genes with high sampling noise) are pulled more aggressively. Highly expressed genes, whose estimates are more reliable, are pulled less.
Concrete example. A gene with mean \(\mu = 10\) and 3 replicates has raw dispersion \(\hat{\alpha} = 0.8\), but the trend curve at \(\mu = 10\) predicts \(\alpha_{\text{trend}} = 0.2\). The empirical Bayes procedure estimates the reliability of \(\hat{\alpha}\) and shrinks it to perhaps \(\alpha_{\text{shrunk}} = 0.25\) — close to the trend because 3 data points provide little information for a reliable per-gene estimate. A highly expressed gene with \(\mu = 10{,}000\) and raw \(\hat{\alpha} = 0.05\) when the trend predicts \(0.03\) shrinks only to \(\alpha_{\text{shrunk}} = 0.035\) — the per-gene estimate is trusted more because the high count data carry more information. The shrinkage is not arbitrary: it is a weighted average of the per-gene estimate and the trend, weighted by the precision (inverse variance) of each.
This is the core workflow:
library(DESeq2)
# Step 1: Build the data object
dds <- DESeqDataSetFromMatrix(
countData = counts, # raw integer counts — never normalized values
colData = metadata, # sample information (condition, batch, etc.)
design = ~ condition # the model: what you are testing
)
# Step 2: Estimate dispersions (NB + empirical Bayes shrinkage)
dds <- estimateSizeFactors(dds) # normalize for sequencing depth
dds <- estimateDispersions(dds) # shrunken NB dispersion estimates
# Step 3: Fit NB GLMs and test for differential expression
dds <- nbinomWaldTest(dds)
results <- results(dds, contrast = c("condition", "treated", "control"))
The estimateDispersions call is where the NB variance formula meets the data. DESeq2 fits per-gene NB models, extracts raw \(\alpha\) estimates, computes the trend, and applies shrinkage — all in a single function.
The Wald test (nbinomWaldTest) then fits a Negative Binomial Generalized Linear Model for each gene with a log link function:
\[\log(\mu_{ij}) = \beta_0 + \beta_1 \cdot \text{condition}_j\]
where \(\mu_{ij}\) is the expected count for gene \(i\) in sample \(j\), and the log link ensures the predicted count is always positive. The coefficient \(\beta_1\) represents the log2 fold change between conditions. The shrunken dispersion estimates from estimateDispersions are used as the variance parameter in the NB likelihood, and the Wald test computes a p-value for whether \(\beta_1 \neq 0\) — i.e., whether the condition effect is detectable above the biological noise captured by \(\alpha\). The GLM framework generalizes naturally to more complex designs with multiple factors, batch effects, and continuous covariates: any design matrix can replace the simple condition vector.
A critical constraint: DESeq2 requires raw integer counts. Normalized values (TPM, FPKM) break the NB variance model because the discrete mean-variance relationship — \(\mu + \alpha\mu^2\) — is defined for raw counts, not for normalized ratios. If you feed normalized counts into DESeq2, the dispersion estimates are meaningless.
Biological Interpretation
The NB dispersion parameter \(\alpha\) is not merely a mathematical convenience — it is a biologically interpretable quantity.
What high \(\alpha\) means. A gene with \(\alpha = 0.5\) is not “noisy” or “poorly measured.” Its expression genuinely varies 2-3 fold across individuals in the same condition. This could reflect true biological heterogeneity: immune genes, stress response genes, and cell cycle regulators are often highly dispersed because they respond to environmental and physiological states that differ between individuals. The interpretative question — as with any statistical signal — is whether the dispersion reflects true biology or technical artifacts. Batch effects, library quality differences, and sample degradation all inflate dispersion estimates artificially.
The dispersion-mean trend as a diagnostic tool. After running DESeq2, the dispersion-mean plot is the first thing to examine. In a clean experiment, most genes cluster around a smoothly decreasing trend line — \(\alpha\) falling as \(\mu\) rises, consistent with the NB variance model. Genes falling far above the trend warrant investigation: they may represent technical artifacts (a batch of samples with systematically different library quality), or they may be genuinely hypervariable genes under strong but heterogeneous regulation. Either way, they are candidates for closer inspection, not automatic exclusion.
Power and the quadratic term. The \(\alpha\mu^2\) term has a direct practical consequence: for highly expressed genes, biological noise dominates statistical power. A gene with \(\mu = 10{,}000\) and \(\alpha = 0.01\) has biological variance \(\alpha\mu^2 = 1{,}000{,}000\) — vastly exceeding the Poisson variance of \(10{,}000\). Adding more sequencing depth does not help; the bottleneck is biological replication. The NB variance formula makes explicit what experienced experimentalists know intuitively: for well-expressed genes, statistical power comes from more replicates, not deeper sequencing. Doubling the sequencing depth reduces sampling noise but does nothing to reduce biological variation.
The reviewer mindset. The Poisson model applied to biological replicates is the single most common statistical error in published omics studies. A Poisson-based differential expression test on biological replicates produces a severe excess of false positives — genes reported as “significant” when the signal is simply normal biological variation. When evaluating a paper, check whether the differential expression method accounts for overdispersion. If the authors used a Poisson model, or — worse — a t-test on raw counts, the results are unreliable.
Johnson’s Rule: If you see a paper using a t-test on raw RNA-seq counts, the reviewer failed. Count data requires count distributions.
The Poisson in Other Omics Contexts
The Poisson distribution appears in several other bioinformatics applications, though with the same caveat: it models technical noise, not biological variation.
ChIP-seq peak calling (MACS2) models the expected number of reads in a genomic window under the null hypothesis of no binding as a Poisson distribution. An observed count exceeding the Poisson expectation past a significance threshold is called a peak:
# Simplified MACS2 logic: Poisson p-value for observed enrichment
p_peak <- ppois(observed_reads - 1, lambda = lambda_background, lower.tail = FALSE)
Phred quality scores are Binomial/Poisson-derived error probabilities: \(Q = -10 \log_{10}(P_{\text{error}})\). Q30 means \(P_{\text{error}} = 0.001\), or 1 expected error per 1,000 bases.
In both cases, the Poisson serves as a null model for technical noise — and in both cases, biological replicates require the same extension to NB or related overdispersed distributions. The principle generalizes: whenever the null assumes only sampling noise, the alternative must account for biological variation.
Current Landscape (Q2 2026)
- Zero-inflated Negative Binomial (ZINB) models remain standard for scRNA-seq but are being challenged by NB + empirical Bayes approaches (sctransform) that argue zero-inflation is primarily technical dropout rather than biological excess zeros — a debate that mirrors the original Poisson-vs-NB argument at higher resolution.
- Compositional models for relative abundance data (ANCOM-BC, ALDEx2) are increasingly adopted for microbiome data where sequencing depth normalization is insufficient and the NB assumption of absolute counts does not hold, because the data are inherently compositional rather than count-based.
- Bayesian hierarchical models for ultra-sparse spatial transcriptomics data are an active development area, extending NB models with spatial priors to borrow information across neighboring tissue positions.
- Count-based deep learning (scVI, scTransform via neural networks) uses NB and ZINB as output distributions in variational autoencoders, preserving the correct noise model while learning non-linear latent representations.
Summary and Required Reading
- Count data requires count distributions — the Gaussian is invalid for individual raw counts.
- Poisson (\(\sigma^2 = \mu\)) models technical noise only — it fails on biological replicates because it cannot account for biological variation.
- Overdispersion (\(\sigma^2 \gg \mu\)) is universal in biological data — the NB captures it with a single parameter \(\alpha\), yielding the variance formula \(\sigma^2 = \mu + \alpha\mu^2\).
- The NB is a Poisson-Gamma mixture — each replicate has its own expression rate \(\lambda\) drawn from a Gamma distribution, and the observed count is Poisson(\(\lambda\)).
- DESeq2 uses empirical Bayes to solve the estimation problem — with few replicates per gene, dispersion estimates are unreliable and must be shrunk toward a mean-dispersion trend.
- The \(\alpha\mu^2\) term explains why replication matters more than depth — for well-expressed genes, biological variation dominates, and deeper sequencing cannot reduce it.
Required Reading
- Love, Huber & Anders: “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2” (Genome Biology, 2014).
- Casella & Berger: Statistical Inference (Ch 3-4) — Poisson and Negative Binomial derivations, Poisson-Gamma mixture.
- Robinson & Smyth: “Small-sample estimation of negative binomial dispersion” (Biostatistics, 2008).
Johnson’s Rule: The Poisson is the correct model for technical replicates. It is almost always the wrong model for biological replicates.