Chapter 13: Metagenomics and the Human Microbiome
Johnson’s First Principle
The majority of microbial life on Earth, and in the human body (has never been cultured. A standard microbiological plate selects for the survivable minority. Metagenomics is the only lens that reveals the unculturable majority. Before you can interpret a microbiome, you must first understand that the data you are analyzing is a compositional shadow) a relative, not absolute, measure of community structure.
Core Concepts
The Uncultured Majority and Shotgun Metagenomics
Microbial culture plates capture only species capable of growing under laboratory oxygen concentrations, nutrient formulations, and temperature conditions. Estimates suggest fewer than 1% of environmental microbes are culturable. Shotgun metagenomics bypasses culture entirely: total DNA is extracted from a sample, fragmented, sequenced, and reads are assigned by computational classification. The resulting data represents the aggregate genomic content of all organisms in the sample; bacteria, archaea, viruses, fungi, and human host.
Two primary experimental strategies exist:
- 16S rRNA Amplicon Sequencing: PCR amplification of the hypervariable V3-V4 region of the 16S ribosomal RNA gene using universal bacterial primers. Fast and cheap; limited to taxonomic classification only, cannot resolve function. Susceptible to primer amplification bias; certain taxa are systematically under-amplified.
- Shotgun Metagenomics: Whole-genome sequencing of extracted community DNA. Enables both taxonomic profiling and functional gene annotation. More expensive; requires deeper sequencing depth to detect low-abundance taxa.
Taxonomic Classification: Two Competing Paradigms
Read-based classifiers assign taxonomy to individual sequenced reads.
k-mer Lowest Common Ancestor (LCA) Classifiers (Kraken2, Bracken): Each k-mer in a read is matched against a pre-built database of all reference genomes. The taxonomic classification of a read is the LCA of all matching k-mer taxa. Extremely fast (\(>1\) Gb/s classification speed). Bracken applies Bayesian re-estimation to produce abundance estimates at any taxonomic level. The critical limitation: unclassified reads represent the “dark matter”; novel species without reference genomes.
Marker-gene Profiling (MetaPhlAn4): Rather than classifying all reads, MetaPhlAn4 maps reads exclusively to species-specific single-copy marker gene databases. This avoids reference bias from human or food contamination reads. The resulting relative abundance profiles are compositionally defined: they sum to 100% across all detected species.
Metagenomic Assembly and MAG Reconstruction
When reference-free analysis is required, reads are assembled de novo into contigs. Two strategies:
- Single-sample assembly: Assembles reads from one sample independently (MEGAHIT, MetaSPAdes). Produces clean species-level contigs but misses low-abundance taxa without sufficient read depth.
- Co-assembly: Pools reads from multiple samples before assembly. Recovers rare organisms across samples, but risks generating chimeric contigs if closely related strains from different samples are incorrectly merged.
Assembled contigs are binned into Metagenome-Assembled Genomes (MAGs) using differential coverage (the same organism has consistent depth across samples) and tetranucleotide frequency (closely related organisms share compositional signatures). MAG quality is validated using CheckM2, which predicts completeness and contamination based on lineage-specific single-copy marker gene presence and copy number.
Functional Annotation
HUMAnN3 maps quality-controlled reads against a nucleotide database (ChocoPhlAn) and a translated protein database (UniRef90). The outputs are gene family abundances (reads-per-kilobase RPK) and pathway coverages (MetaCyc). This enables comparison of microbial community functional capacity across samples.
Metatranscriptomics: Total RNA shotgun sequencing measures active transcription in a community. Ribosomal RNA depletion is required to enrich for mRNA. Metatranscriptomics integrated with metagenomics reveals which genes are expressed, not just which genes are present.
Compositional Statistics
Microbiome data is compositional: sequencing produces relative, not absolute, abundances. The sum of all taxon proportions is always 1 (or 100%). Applying standard statistical tests that assume independence (t-test, Wilcoxon) to compositional data violates the Aitchison simplex geometry and produces spurious results.
Log-ratio transformations project compositional data into Euclidean space where standard linear statistics are valid:
- Centered Log-Ratio (CLR): \(\text{CLR}(x_i) = \ln\left(\frac{x_i}{g(\mathbf{x})}\right)\), where \(g(\mathbf{x})\) is the geometric mean of all components.
Compositionally robust differential abundance tools:
- ALDEx2: Monte Carlo Dirichlet samples to propagate compositional uncertainty through CLR transformation before differential abundance testing.
- ANCOM-BC: Models absolute abundance as the difference between observed log relative abundance and sample-specific bias; tests W-statistic (fraction of log-ratio comparisons significant).
Biological Interpretation
The human microbiome is not a single entity; it varies by body site (gut, skin, oral, vaginal), host genetics, diet, medication history, and disease state. A microbiome study reporting “healthy vs. disease” differences without controlling for antibiotic use, dietary fiber intake, geographic origin, and sequencing batch is not interpretable. The effect of dietary fiber on gut Bacteroidetes:Firmicutes ratio exceeds the difference between healthy and disease states in most inflammatory bowel disease studies.
The Reviewer Mindset: Never apply standard statistical tests to raw relative abundance columns. If Bacteroides increases as a proportion, every other taxon’s relative abundance decreases automatically; even if their absolute cell counts are unchanged. Alpha diversity (Shannon entropy, species richness) and beta diversity (Bray-Curtis dissimilarity, UniFrac distance) capture distinct signals: alpha diversity describes within-sample richness; beta diversity describes between-sample compositional distance. These must not be conflated. A single MAG does not constitute evidence of an organism’s functional role in the community; its ecological role requires metatranscriptomic confirmation and ideally a germ-free mouse colonization experiment.
Summary and Required Reading
The uncultured majority of the microbial world is accessible only through shotgun metagenomics. The primary analytical challenges are compositional statistics, assembly chimeras, and functional dark matter (unknown protein families). Every microbiome study must demonstrate control for confounders and compositional-aware statistical methods before its results are interpretable.
Required Reading:
Quince et al.: “Shotgun metagenomics, from sampling to analysis” (Nature Biotechnology, 2017).
Gloor et al.: “Microbiome datasets are compositional: and this is not optional” (Frontiers in Microbiology, 2017).
Parks et al.: “CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality” (Nature Methods, 2023).