Chapter 11: Transcriptomics and Expectation-Maximization

Published

June 5, 2026

Modified

June 19, 2026

Johnson’s First Principle: Quantification is a Mixture Model Estimation Problem

A read that maps to three different transcripts is a single physical molecule whose biological origin is ambiguous. A naive algorithm that counts it three times creates data out of thin air. The Expectation-Maximization (EM) algorithm resolves this ambiguity by distributing each read as a fractional probability across its possible origins.

Core Concepts

Splice-Aware Alignment: Why DNA Aligners Fail on RNA-seq

DNA aligners (BWA, Bowtie2) assume contiguous alignment. A read spanning an exon-exon junction requires a deletion of the entire intron; thousands of bases. The aligner assigns MAPQ 0 and discards the read, losing up to 50% of RNA-seq data.

STAR (Dobin et al., 2013) solves this with a suffix array (see Chapter 8) that finds maximal mappable seeds (MMPs) from each read. When seed extension fails (the read crosses an intron), STAR starts a new seed. Seeds separated by a genomic gap represent splice junctions, validated against canonical splice site motifs (GT-AG = >99% true junctions). Two-pass mode: discover junctions from all samples, then re-align with the combined junction index.

Expectation-Maximization for Transcript Quantification

Consider a gene with three isoforms sharing exons. A read from a shared exon could originate from any isoform, but we cannot determine which one from the read sequence alone. The EM algorithm resolves this ambiguity iteratively.

The model. For each read \(r\) and transcript \(t\), \(P(r \mid t) = 1\) if the read maps to the transcript (accounting for fragment length and position), \(0\) otherwise. This defines the compatibility between reads and transcripts; the only data available to resolve ambiguity.

E-step (Expectation): Given current abundance estimates \(\theta\), compute the probability that each read \(r\) originated from each transcript \(t\):

\[P(r \in t \mid \theta) = \frac{\theta_t \cdot P(r \mid t)}{\sum_{t'} \theta_{t'} \cdot P(r \mid t')}\]

A uniquely-mapping read gets \(P = 1\) for its transcript, \(0\) for all others. A multi-mapping read is split: \(\theta_t\) acts as a prior weight; more abundant transcripts receive a larger fraction of the ambiguous read.

M-step (Maximization): Update transcript abundances from the fractional assignments:

\[\theta_t = \frac{\sum_r P(r \in t \mid \theta)}{\sum_{t'} \sum_r P(r \in t' \mid \theta)}\]

The numerator is the effective read count for transcript \(t\); the denominator normalizes across all transcripts.

Concrete example. Two isoforms sharing one exon, 10 total reads: 6 unique to isoform 1, 2 unique to isoform 2, 2 ambiguous (map to both). Initialize \(\theta_1 = \theta_2 = 0.5\).

Iteration 1; E-step: the 6 unique reads assign fully to isoform 1, the 2 unique to isoform 2. Each ambiguous read splits: \(P(\text{iso1}) = 0.5\), \(P(\text{iso2}) = 0.5\). M-step: \(\theta_1 = (6 + 0.5 + 0.5)/10 = 0.70\), \(\theta_2 = (2 + 0.5 + 0.5)/10 = 0.30\).

Iteration 2 (E-step: ambiguous reads now split \(0.70:0.30\). M-step: \(\theta_1 = (6 + 0.7 + 0.7)/10 = 0.74\), \(\theta_2 = (2 + 0.3 + 0.3)/10 = 0.26\). Convergence to \(\sim 0.75:0.25\) within 5 iterations) the ambiguous reads are distributed in proportion to the uniquely-mapping evidence.

The EM algorithm maximizes the likelihood \(P(\text{all reads} \mid \theta)\); the probability of observing the entire read set given the transcript abundances, and guarantees monotonic increase toward a local maximum.

The transcriptome reference. EM depends on the completeness of the annotation. If a novel isoform exists but is absent from the reference, its reads are treated as ambiguous among the closest annotated isoforms, and EM distributes them across incomplete representatives, biasing all abundance estimates. This is why novel transcript discovery (StringTie, IsoQuant) should precede quantification in discovery-oriented studies.

Salmon and Kallisto: pseudoalignment. Rather than aligning each read to the genome, Salmon and Kallisto hash each read’s k-mers against a transcriptome index. Reads sharing the same set of compatible transcripts form an equivalence class (e.g., “all reads compatible with transcripts {1, 3, 5}”). EM runs on class-level aggregate counts rather than individual reads, reducing the optimization from millions of objects to thousands; a 100x speedup with equivalent gene-level accuracy.

Normalization: Why RPKM/FPKM Are Wrong

RPKM/FPKM normalize by gene length and sequencing depth but are not comparable across samples:

\[\text{RPKM} = \frac{\text{counts}}{\text{gene length (kb)} \times \text{total mapped reads (millions)}}\]

The sum of RPKM values across all genes differs systematically between samples because the denominator (total mapped reads) is dominated by highly expressed genes, when one gene doubles, all others mechanically appear to decrease. TPM (transcripts per million) fixes this:

\[\text{TPM}_i = \frac{\text{counts}_i / \text{length}_i}{\sum_j \text{counts}_j / \text{length}_j} \times 10^6\]

The denominator \(\sum_j (\text{counts}_j / \text{length}_j)\) is the sum of per-gene rates across all genes. Normalizing by this value rather than total mapped reads removes the artifact where a single highly expressed gene deflates all others: TPM is a proportion, so its sum is always 1 million regardless of sequencing depth or expression distribution.

Differential Expression: The Standard Workflow

library(DESeq2)

dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~ condition)
dds <- DESeq(dds)                 # NB dispersion + Wald test
results <- results(dds, contrast = c("condition", "treated", "control"))

DESeq2 fits a Negative Binomial GLM per gene, shrinks dispersion estimates via empirical Bayes, and computes Wald test p-values with Benjamini-Hochberg FDR correction. The input must be raw integer counts; normalized values (TPM, FPKM) break the NB variance model.

Biological Interpretation

TPM is the correct relative abundance metric for within- and between-sample comparison. The EM algorithm depends on the completeness of the transcriptome reference: if a novel isoform is present but missing from the annotation, its reads are assigned to the nearest annotated isoform, biasing all estimates. This is why novel transcript discovery (StringTie, IsoQuant) should precede quantification in discovery-oriented studies.

RNA-seq quantification artifacts can mimic biological signals: GC bias (systematic underrepresentation of AT-rich or GC-rich fragments), 3’ bias in degraded RNA (apparent downregulation of 5’ exons), and fragment length bias in FFPE samples. These must be diagnosed before biological interpretation.

The choice of quantification method matters: alignment-based (STAR + featureCounts) is more accurate at gene level; pseudoalignment (Salmon, Kallisto) is faster and more accurate at transcript level. For standard gene-level differential expression, either is acceptable.

Summary and Required Reading

Exon-exon junctions require splice-aware alignment: DNA aligners (BWA) lose up to 50% of RNA-seq reads.
EM resolves multi-mapping ambiguity: Salmon/Kallisto use k-mer equivalence classes for 100x speedup.
TPM is the correct normalization metric: RPKM/FPKM are not comparable across samples.
DESeq2 uses Negative Binomial GLMs with empirical Bayes dispersion shrinkage: input must be raw counts.

Required Reading

Dobin et al.: “STAR: ultrafast universal RNA-seq aligner” (Bioinformatics, 2013).
Patro et al.: “Salmon provides fast and bias-aware quantification of transcript expression” (Nature Methods, 2017).

Johnson’s Rule: A read that maps to three transcripts is not a problem; it is data. Ignoring it is the problem.