Chapter 6: Statistical Inference and Confounder Management

Johnson’s First Principle: Every P-value is Conditional on What You Model

If you run a differential expression test comparing 10 cancer patients against 10 healthy controls and produce a list of 5,000 “significant” genes, you have not discovered the genomic signature of cancer. If the cancer patients were sequenced in January and the controls in March, you have discovered the signature of a sequencing batch. A p-value is always conditional on what you chose to include in your model — and every confounder you omit inflates the variance, producing false positives that look like biological signal.

The problem is structural: confounders (batch effects, library quality differences, population structure) increase the variance of the measurement. If you do not model them, the extra variance is absorbed into the error term, and the statistical test becomes systematically anticonservative — it calls genes significant when the observed difference is due to technical variation rather than biology.


Core Concepts

How Confounders Become False Positives

Consider a gene whose expression does not differ between cancer and control. Under the null hypothesis, the p-value should be uniformly distributed: 5% of genes should be called significant at \(\alpha = 0.05\). Now introduce a confounder: the cancer samples were processed in a batch with slightly lower library quality, producing systematically lower counts. The test now sees a systematic difference between groups that has nothing to do with cancer. The variance across samples increases because the batch effect adds another source of variation. The p-values shift left, and genes that are not biologically differential are called significant.

Every technique in this chapter — choosing the correct distributional test, controlling the false discovery rate, building proper design matrices, applying empirical Bayes moderation, and detecting batch effects — is a defense against this single failure mode: unmodeled confounders inflating the false positive rate.

Hypothesis Testing: The Foundation

Every statistical test begins with a null (\(H_0\): no difference) and alternative (\(H_1\): difference exists). The p-value is \(P(\text{data} \mid H_0)\) — the probability of observing data as extreme as what was observed, assuming the null is true. It is not \(P(H_0 \mid \text{data})\) — the probability that the null is true given the data. This distinction is the most commonly misunderstood concept in science. A p-value of 0.01 does not mean “99% chance the effect is real.”

The choice of test depends on the distributional assumptions of the data:

Data Type Distribution Appropriate Test Omics Context
Raw counts Negative Binomial DESeq2 / edgeR Wald test Bulk RNA-seq differential expression
Log-normalized Approximately normal Student’s t / limma Processed microarray or RNA-seq
scRNA-seq counts Zero-inflated, non-normal Wilcoxon rank-sum Marker gene identification
Contingency tables Hypergeometric Fisher’s exact GO/KEGG enrichment

Raw RNA-seq counts are not normal — they are discrete, overdispersed, and heteroskedastic. Chapter 4 established that the Negative Binomial is the correct distribution, and DESeq2’s empirical Bayes shrinkage (detailed in that chapter) is the correct estimation framework for bulk data.

Multiple Testing: The Arithmetic of Confounded Signal

Testing 20,000 genes at \(\alpha = 0.05\) guarantees \(20{,}000 \times 0.05 = 1{,}000\) false positives by definition, even with no confounders. Adding unmodeled confounders makes this worse.

Bonferroni correction controls the family-wise error rate (FWER) — the probability of making even one false discovery — by dividing \(\alpha\) by the number of tests:

\[\alpha_{\text{Bonferroni}} = \frac{0.05}{20000} = 2.5 \times 10^{-6}\]

Bonferroni is conservative: it guarantees no false positives but destroys power, especially for small-effect genes.

The Benjamini-Hochberg False Discovery Rate (FDR) is the standard in omics. It controls the expected proportion of false positives among the discoveries:

\[q_{(i)} = \min_{j \geq i} \left( \frac{m}{j} p_{(j)} \right)\]

Concrete example.

Testing 10 genes (\(m=10\)), the raw p-values sorted ascending are \(p_{(1)} = 0.001\), \(p_{(2)} = 0.01\), \(p_{(3)} = 0.02\), \(p_{(4)} = 0.04\), \(p_{(5)} = 0.20\), …

For each gene, multiply \(m/j\) by its p-value: \(q_{(1)} = 10/1 \times 0.001 = 0.01\), \(q_{(2)} = 10/2 \times 0.01 = 0.05\), \(q_{(3)} = 10/3 \times 0.02 = 0.067\)

But \(q_{(2)} = 0.05\) is less than \(q_{(3)} = 0.067\), violating monotonicity. The \(\min_{j \geq i}\) step fixes this: \(q_{(2)} = \min(0.05, 0.067) = 0.05\), \(q_{(3)} = \min(0.067, ...) = 0.067\).

At FDR < 0.05, genes 1 and 2 are significant — the procedure accepts \(m \times \text{FDR} = 10 \times 0.05 = 0.5\) expected false positives, compared to Bonferroni which would require \(p < 0.05/10 = 0.005\) and call only gene 1.

FDR of 0.05 means: “of the genes I call significant, I accept that 5% are false positives.” This is a weaker guarantee than Bonferroni (which ensures zero false positives), but it preserves power for the majority of true signals — the tradeoff that makes FDR the standard in discovery-oriented omics. In modern omics, raw p-values are ignored in favor of FDR-adjusted q-values.

Why this matters for confounder management Unmodeled confounders produce a left-shift in the p-value distribution — more small p-values than expected under the null. This manifests as an unusually large number of significant genes at a given FDR threshold. If a differential expression analysis produces 5,000 significant genes at FDR < 0.05 in a experiment with modest biological effect, suspect unmodeled confounders before celebrating biological discovery.

The p-value histogram as a diagnostic. The simplest confounder diagnostic is a histogram of raw p-values before multiple testing correction. Under the global null (no true positives), the p-values are uniformly distributed between 0 and 1 — each bin has roughly equal counts. If some genes are truly differential, the histogram shows a spike near zero (true positives) on top of a uniform background (null genes). If the histogram shows a broad left skew — too many p-values in the 0.01-0.10 range without a sharp spike at zero — this signals unmodeled confounders inflating the variance of null genes. A right skew suggests a misspecified model or overly conservative test. The p-value histogram should be the first diagnostic plot before any FDR adjustment.

Linear Models: Modeling Variance Explicitly

The general linear model \(y = X\beta + \epsilon\) converts experimental metadata into a design matrix \(X\) that partitions variance into modeled components (biological signal and known confounders) and residual error.

A two-group comparison with no confounders:

group <- factor(c("control", "control", "treatment", "treatment"))
X <- model.matrix(~ group)
# Intercept = control mean; grouptreatment = difference from control

With confounders included, their variance is absorbed by the model rather than inflating the error term:

# Cell line variance is absorbed so it does not inflate error for treatment
design <- model.matrix(~ cell_line + treatment)

The key principle: any variable you omit from the design matrix goes into the error term, increasing residual variance and reducing power. This is why including known confounders (batch, sex, age) is always correct — even if the confounder is not significant, modeling it cannot hurt and can only help.

Mixed models extend this to random effects — sources of variance that are not of primary interest but have multiple levels:

lmer(expression ~ drug + (1 | patient), data = df)

The (1 | patient) term estimates the variance of patient effects rather than estimating each patient as a separate fixed effect — more parsimonious when patients are a random sample from a population.

limma: Empirical Bayes Variance Moderation

limma is the gold standard for differential expression of normalized data. Its innovation directly parallels DESeq2 (Chapter 4): with few replicates (3 per group), per-gene variance estimates are noisy. limma borrows information across all 20,000 genes to stabilize the variance, using empirical Bayes moderation:

library(limma)
v <- voom(dge, design)       # Mean-variance modeling (log-normalized)
fit <- lmFit(v, design)       # Linear model per gene
fit <- eBayes(fit)            # Empirical Bayes variance moderation
results <- topTable(fit, coef = "treatment", number = Inf, adjust = "BH")

The parallel with DESeq2 is intentional: both methods face the same fundamental problem (few replicates, many genes) and solve it with the same mathematical framework (empirical Bayes shrinkage). DESeq2 shrinks the NB dispersion \(\alpha\) toward a mean-dispersion trend; limma shrinks the gene-wise variance toward a common prior. The biological interpretation is identical: genes with unreliable per-gene variance estimates are stabilized by borrowing strength from the ensemble.

Batch Effects: Detection and Correction

Batch effects are the most common and dangerous confounders in omics data. They explain a substantial fraction of total variance — often 30-50% in typical genomics experiments — while biological signal of interest frequently explains less than 5%. The first diagnostic is PCA or MDS colored by technical variables:

plotMDS(v, col = ifelse(batch == "batch1", "red", "blue"),
        main = "MDS colored by sequencing batch")

If batch-colored plots show separation but condition-colored plots do not, the technical artifact dominates the biological signal.

ComBat (Johnson, Li & Rabinovic, 2007) corrects batch effects using empirical Bayes. Its model for gene \(g\), sample \(j\), batch \(i\):

\[Y_{ijg} = \alpha_g + X\beta_g + \gamma_{ig} + \delta_{ig}\epsilon_{ijg}\]

ComBat estimates additive (\(\gamma\)) and multiplicative (\(\delta\)) batch effects, shrinks them toward a global prior via empirical Bayes, and subtracts them while preserving the biological signal \(X\beta_g\):

library(sva)
corrected <- ComBat(dat = expression_matrix, batch = batch_vector,
                    mod = model.matrix(~ condition))

Critical caveat: ComBat assumes biological groups are balanced across batches. If all cases are in batch 1 and all controls in batch 2, batch is perfectly confounded with condition — the model is unidentifiable. ComBat will run without errors and produce adjusted data, but the correction will remove biological signal alongside technical noise, and the researcher will never know. This is not a software bug; it is a mathematical impossibility.

When the confounder is unknown: Surrogate Variable Analysis (SVA). ComBat requires the batch variable to be known (sequencing run, processing date, technician). But many confounders are unknown — subtle environmental variation, reagent lot changes, or instrument drift that affects expression but was not recorded. SVA infers these latent factors directly from the expression data. It decomposes the expression matrix into known biological variation (modeled via the design matrix), residual noise, and estimated surrogate variables — hidden factors that explain systematic variation not captured by the known covariates. These surrogate variables are then included as covariates in the differential expression model:

sv <- sva(expression_matrix, model_matrix, null_model)
design_sva <- cbind(design, sv)

SVA is not a replacement for ComBat — it addresses a different failure mode. Use ComBat for known batch variables; use SVA when you suspect hidden confounders. Both rely on the same empirical Bayes machinery, and both fail silently when biological signal is confounded with technical variation.


Biological Interpretation

Batch effects are a design problem, not a computational problem. The single most important decision in any omics experiment is randomizing samples across batches before sequencing. If cases and controls are run in separate batches, no amount of computational correction can recover the biological signal — the confound is structurally inseparable from the biology.

A lab that tells you “we will correct batch effects with ComBat after sequencing” without randomizing the sample layout is making a statistical error that cannot be fixed. Batch correction algorithms assume balanced designs. Confounded designs violate that assumption silently. The algorithm produces output, but the output is not interpretable.

The FDR is the minimum standard. If a paper reports differential expression results without mentioning multiple testing correction, the reported gene list is unreliable. At FDR < 0.05 with 20,000 genes tested, approximately 1,000 false positives are expected if raw p-values were used — and that is before accounting for unmodeled confounders.

The diagnostic checklist for a differential expression result:

  1. Was the experimental design balanced across batches? If not, the result is unreliable regardless of the statistical method.
  2. Was multiple testing correction applied (FDR or Bonferroni)? If only raw p-values are reported, expect ~1,000 false positives.
  3. Does the number of significant genes match the expected effect size? An implausibly large number suggests unmodeled confounders.
  4. Was the correct distributional model used? Raw counts require Negative Binomial (DESeq2 or edgeR), not a t-test or Poisson.

Current Landscape (Q2 2026)

  • Harmony (iterative clustering-based correction) and scVI (VAE-based) are replacing ComBat for single-cell batch correction, offering non-linear correction that preserves population structure across batches.
  • RemoveBatchEffect (limma) is being replaced by latent factor regression methods that jointly model biological and technical factors in a unified framework, avoiding the two-step “correct then analyze” workflow.
  • Causal inference frameworks (causalMM, DML) are entering omics for estimating treatment effects from observational data with confounder adjustment, moving beyond associational p-values to causal effect estimates.
  • The “batch correction as preprocessing” paradigm is being questioned: correcting before splitting train/test introduces data leakage in ML pipelines. Batch correction should be fit on training data and applied to test data, not applied globally.

Summary and Required Reading

  1. A p-value is conditional on the model — every omitted confounder inflates the error term and produces false positives. The entire chapter is a defense against this single failure mode.
  2. Use the correct distributional test — raw counts → Negative Binomial (DESeq2/edgeR). Log-normalized → limma. scRNA-seq → Wilcoxon.
  3. Correct for multiple testing — FDR (Benjamini-Hochberg) is the minimum standard. Raw p-values guarantee 1,000 false positives at 20,000 tests.
  4. Model confounders in the design matrix — every omitted confounder inflates residual variance. Mixed models for hierarchical structure.
  5. Empirical Bayes moderation (limma) parallels DESeq2 — both solve the few-replicates problem by borrowing strength across genes.
  6. Batch effects dominate omics variance (30-50% of total) — detect with PCA colored by batch, correct with ComBat. Never correct a confounded design.
  7. Batch randomization is the only real solution — if cases and controls are in separate batches, the design is unidentifiable.

Required Reading

  • Benjamini & Hochberg: “Controlling the False Discovery Rate” (JRSS B, 1995).
  • Johnson, Li & Rabinovic: “Adjusting batch effects in microarray expression data using empirical Bayes methods” (Biostatistics, 2007).
  • Smyth: “Linear models and empirical Bayes methods for assessing differential expression” (SAGMB, 2004).

Johnson’s Rule: Batch effects are the single greatest threat to omics data. Plan for them before you sequence. If you cannot detect a batch effect, you did not look hard enough.