Chapter 28: Pathogen Genomics and Genomic Epidemiology
Johnson’s First Principle: The Mutation Rate is a Molecular Clock
Every replication introduces errors. In a virus with a genome of 10,000 bases and a mutation rate of \(10^{-4}\) per base per replication, each infected individual carries a slightly different viral population. These mutations accumulate over time, providing a record of transmission history, geographic spread, and evolutionary adaptation.
Core Concepts
The Molecular Clock
The molecular clock is the rate at which mutations accumulate in a genome over time. For RNA viruses (SARS-CoV-2, influenza, HIV), the substitution rate is approximately \(10^{-3}\) to \(10^{-4}\) substitutions per site per year — roughly 1-2 mutations per genome per month for SARS-CoV-2.
This rate is fast enough to track transmission chains in near-real-time, but slow enough that sequences from the same outbreak cluster together phylogenetically.
Phylogenetic Trees and Phylodynamics
Phylogenetic reconstruction (IQ-TREE, FastTree, RAxML-NG) builds a tree relating pathogen sequences. Branch lengths reflect genetic distance; tree topology reflects transmission relationships.
Substitution models specify the relative rates of different nucleotide changes during evolution. The general time-reversible (GTR) model — the standard for most pathogen phylogenies — estimates six exchangeability rate parameters (transitions like A↔︎G are typically 2-5× faster than transversions like A↔︎C) and accounts for variable rates across sites via a gamma distribution (GTR+Γ). The choice of model affects branch length estimation: an overly simple model underestimates divergence at saturated sites, producing artificially short terminal branches.
Phylodynamics combines phylogenetics with epidemiology: the shape of the tree encodes information about population dynamics. A “star-shaped” tree (many short branches radiating from a single point) indicates rapid exponential growth. A “ladder-shaped” tree indicates sustained transmission with ongoing lineage turnover.
Molecular clock dating (BEAST, TreeTime) calibrates the tree with dates: tip dates (sampling dates of each sequence) and a clock model estimate the time to most recent common ancestor (tMRCA) for any clade. Three clock models differ in their assumptions: a strict clock assumes all lineages evolve at the same rate (rarely realistic for real pathogen populations); a relaxed lognormal clock allows each branch to have a different rate drawn from a lognormal distribution (the default for most phylodynamic analyses); and a relaxed exponential clock draws rates from an exponential distribution for simpler parameterization. The choice of clock model affects tMRCA estimates — using a strict clock when rates vary across lineages systematically biases divergence dates toward the present.
Antimicrobial Resistance (AMR) Detection
Resistance arises through point mutations (target modification), horizontal gene transfer (acquisition of resistance genes on plasmids), or efflux pump upregulation.
Computational detection: align sequencing reads to the CARD (Comprehensive Antibiotic Resistance Database) or ResFinder databases. Horizontal gene transfer (HGT) detection requires identifying mobile genetic elements (transposases, integrases) flanking resistance genes in metagenomic or isolate assemblies.
Genomic Surveillance Infrastructure
Nextstrain provides real-time phylogenetic tracking of pathogen evolution. Built on Augur (bioinformatics pipeline: alignment → tree → dating → annotation) and Auspice (visualization), it enabled the first real-time genomic epidemiology during the COVID-19 pandemic.
Essential concepts: lineage nomenclature (Pango lineages for SARS-CoV-2), defining mutations for variants of concern, monitoring immune escape mutations in spike protein.
Biological Interpretation
A phylogenetic tree of pathogen sequences does not directly prove transmission direction — it proves relatedness. A cluster of identical sequences from patients in the same hospital is strong evidence of nosocomial transmission, but identical sequences can also arise from a common unsampled source.
Recombination breaks the tree assumption. Many pathogens (HIV, SARS-CoV-2, influenza) undergo recombination or reassortment, producing genomes whose segments have distinct evolutionary histories. A bifurcating tree cannot represent a genome that is half lineage A and half lineage B — recombination creates phylogenetic networks, not trees. Recombination detection tools (RDP5, GARD) identify breakpoints and partition the alignment into regions with distinct phylogenies before tree inference. Ignoring recombination produces misleading branch lengths and false-positive signals of adaptive evolution.
Immune escape mutations must be interpreted in context: a mutation that reduces antibody neutralization in vitro may or may not affect vaccine effectiveness in vivo. The correlate of protection is not simply the number of mutations but their effect on the specific epitopes targeted by vaccine-induced immunity.
The molecular clock has a resolution limit. If the mutation rate is 1 per genome per month, sequences sampled days apart may be indistinguishable, making within-outbreak transmission inference ambiguous.
Current Landscape
- Wastewater-based epidemiology has matured into a quantitative surveillance tool: RT-qPCR for total viral load + sequencing for variant proportions.
- Nanopore adaptive sampling enables targeted sequencing of pathogen genomes directly from clinical samples without culture.
- Bayesian phylodynamic models now integrate multiple data sources (cases, hospitalizations, seroprevalence, mobility data) for real-time epidemic tracking.
- Machine learning methods (EVEmodel, Spike2) predict future immune escape variants by simulating antibody-virus co-evolution.
Summary and Required Reading
- Hadfield et al.: “Nextstrain: real-time tracking of pathogen evolution” (Bioinformatics, 2018)
- Rambaut et al.: “A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology” (Nature Microbiology, 2020)
- Jia et al.: “CARD 2023: expanded curation, support for machine learning, and resistome prediction at the Comprehensive Antibiotic Resistance Database” (NAR, 2023)
Johnson’s Rule: A phylogenetic tree is a hypothesis of relatedness, not a proven transmission chain. The molecular clock only ticks so fast.