Chapter 9: De Novo Assembly and Graph Algorithms
Johnson’s First Principle: The Shortest Superstring Problem
Reconstructing a genome from fragments is the Shortest Common Superstring Problem — finding the shortest string that contains every read as a substring. This problem is NP-hard. When reads are shorter than repetitive elements, there is no unique mathematical solution regardless of algorithmic sophistication. Assembly is always an approximate reconstruction, and its quality is bounded by the relationship between read length and repeat length.
Core Concepts
The Repeat Trap
The human genome is over 50% repetitive: Alu elements (~300 bp, ~1M copies), LINE1 (~6 kb, ~500K copies), and segmental duplications (10-500 kb). If a read is shorter than a repeat, the assembler cannot determine whether it came from copy A or copy B. Consider the genome A—REPEAT—B and C—REPEAT—D. Reads from the repeat are identical regardless of origin; the assembly graph branches into two diverging paths with no information to resolve which is correct:
\[\text{Repeat length} > \text{read length} \implies \text{assembly graph breaks at repeat boundaries}\]
This is an information-theoretic limit, not an algorithmic weakness. No algorithmic cleverness can recover the correct path across a repeat when the reads are shorter than the repeat unit. The only solution is longer reads or mate-pair information that bridges the repeat.
From Hamiltonian to Eulerian: Why k-mers Change the Problem
Two graph formulations dominate assembly, and the difference between them explains the revolution in algorithmic efficiency.
Overlap-Layout-Consensus (OLC) models each read as a node, with edges connecting reads that overlap by a minimum number of bases. The genome is the Hamiltonian path — visiting each node exactly once — which is NP-complete (\(O(N!)\) in the number of reads). For 10 billion short reads, this is computationally impossible. OLC was viable only for Sanger-era projects with thousands of reads.
De Bruijn graphs transform the problem by fragmenting all reads into k-mers (substrings of length \(k\)) and using k-mers as nodes. This changes the complexity fundamentally: the node vocabulary has a fixed upper bound (\(4^k\) possible k-mers), regardless of how many reads are sequenced. The genome is now an Eulerian path — traversing each edge exactly once — solvable in \(O(E)\) time.
Concrete example. Genome = GATTACA with k=3. The k-mers are GAT, ATT, TTA, TAC, ACA. The De Bruijn graph represents each (k-1)-mer as a node and each k-mer as a directed edge:
Nodes (2-mers): GA AT TT TA AC CA
Edges (3-mers): GA→AT AT→TT TT→TA TA→AC AC→CA
The Eulerian path is GA→AT→TT→TA→AC→CA. Reconstructing the sequence: start node GA (the first k-1 bases), then append the last character of each edge in path order: G + A + T + T + A + C + A = GATTACA. The genome is recovered.
Eulerian path conditions. For a graph to contain an Eulerian path, node degrees must satisfy: one node has out-degree = in-degree + 1 (start), one node has in-degree = out-degree + 1 (end), and all other nodes are balanced. In the GATTACA graph, GA has out=1, in=0 (start) and CA has in=1, out=0 (end). These conditions are checked in \(O(E)\) time.
The NP-hard Hamiltonian problem becomes tractable here because k-mers decouple the problem from read count. OLC has as many nodes as reads (10 billion), and checking all orderings is \(O(N!)\). De Bruijn has at most \(4^k\) possible nodes (theoretically \(4^{31} \approx 4.6 \times 10^{18}\), but typically millions in practice), and Eulerian path is \(O(E)\). The tradeoff is memory: a human genome de Bruijn graph with k=31 stores billions of k-mer occurrences and requires 100-300 GB of RAM.
# De Bruijn graph from k-mers
def build_debruijn(reads, k=31):
graph = {}
for read in reads:
for i in range(len(read) - k + 1):
kmer = read[i:i+k]
left = kmer[:-1]
right = kmer[1:]
graph.setdefault(left, set()).add(right)
return graph
The k-mer tradeoff. The choice of \(k\) determines how the graph balances connectivity against repeat resolution. Small \(k\) (21) means each k-mer is shorter and more likely to be shared between reads, creating a more connected graph. But shorter k-mers are also more likely to fall within repeats, collapsing repeat copies into the same node. Large \(k\) (63) distinguishes repeat copies more effectively because the k-mer extends beyond the repeat unit, but the stricter match requirement produces a more fragmented graph in low-coverage regions. Many assemblers address this by using multiple \(k\) values (SPAdes uses a k-mer sweep from 21 to 127) or by building a multi-sized de Bruijn graph.
Graph Cleaning and Resolution
A raw de Bruijn graph contains artifacts from sequencing errors and biological variation:
- Tips: Dead-end paths created by sequencing errors at read ends. A single erroneous base at the end of a read creates a k-mer that does not connect to the rest of the graph. Tips are removed if they are shorter than \(2k\) and have no supporting coverage.
- Bubbles: Alternative paths caused by heterozygous variants, alternative splicing, or sequencing errors. Two paths diverge and reconverge within a short distance. Bubbles are resolved by keeping the higher-coverage path and removing the lower.
- Bulges: Compressed bubbles — short alternative paths caused by a single nucleotide variant. Like bubbles but shorter; removed by the same heuristic.
After cleaning, the graph is resolved into contigs by finding unambiguous paths — nodes with exactly one incoming and one outgoing edge. Branch points (nodes with multiple in/out edges) indicate repeats or unresolved variation and terminate contigs.
The density of artifacts in the raw graph is a direct diagnostic of sequencing error rate. A high-quality dataset (HiFi, PCR-free Illumina) produces a clean graph with few tips; a noisy dataset produces a dense artifact graph that requires aggressive cleaning, which in turn reduces contiguity by removing true low-coverage regions along with errors. The ratio of tips removed to contigs produced is a useful quality metric for the input data.
Long-Read Assembly
Long reads (PacBio HiFi 10-25 kb, ONT 10-100 kb) solve the repeat problem that short reads cannot. A read spanning an entire Alu element (~300 bp) or LINE1 element (~6 kb) determines unambiguously which flanking sequence it connects to. Modern long-read assemblers have largely replaced short-read assembly for eukaryotic genomes.
- Hifiasm (HiFi-only): Produces haplotype-resolved assemblies by using Hi-C or trio information to phase variants. The standard for human genome assembly; produces fully phased assemblies in hours.
- Flye (ONT + HiFi): Uses a repeat graph construction that handles the high error rates of raw ONT reads. Preferred for metagenomic assembly and bacterial genomes where coverage depth varies.
- Canu (legacy PacBio CLR): OLC-based with error correction before assembly. Largely superseded by Hifiasm and Flye for most applications.
Validation Metrics
| Metric | What It Measures | Caveat |
|---|---|---|
| N50 | Contiguity: 50% of assembly in contigs of this size or larger | Does not measure accuracy |
| L50 | Number of contigs to reach N50 | High L50 = fragmented |
| BUSCO | Presence of conserved single-copy orthologs | Measures gene content, not genome-wide correctness |
| QV / Assembly quality | Phred-scaled consensus error rate (via merqury) | Q40 = 99.99% accurate at consensus level |
N50 is the most widely reported — and most widely misused — assembly metric. High N50 does not mean high quality. A contiguous but systematically wrong assembly (mis-joined repeats, collapsed segmental duplications) can achieve high N50 while having low biological accuracy. BUSCO scores measure conserved gene content, not structural correctness. QV estimation via k-mer comparison (merqury) provides the most honest accuracy assessment by comparing k-mer frequencies between the assembly and raw reads.
Genome size estimation. Before assembly, the k-mer frequency spectrum (histogram of k-mer occurrence counts) provides a rapid genome size estimate: \(G = \frac{\text{total distinct k-mers}}{\text{peak k-mer coverage}}\), where the peak in the distribution corresponds to the haploid sequencing depth. If the estimated size deviates substantially from the expected genome size, the data may be contaminated, have unexpected ploidy, or suffer from systematic sequencing bias — catching this before assembly saves hours of failed computation.
Metagenomic Binning
For metagenomic data, the assembly produces a mixture of contigs from multiple organisms. Binning groups contigs into metagenome-assembled genomes (MAGs) using two orthogonal signals:
- Composition: Tetranucleotide frequency (TNF) profiles — each genome has a characteristic oligonucleotide signature that distinguishes it from other genomes in the sample.
- Coverage: Differential abundance across samples — contigs from the same genome have correlated coverage profiles across multiple samples or conditions.
Tools: CONCOCT, MetaBAT2, MaxBin2. Validation: CheckM assesses MAG quality via lineage-specific single-copy marker genes. An “essentially complete” MAG requires >90% completeness and <5% contamination. A MAG is a population consensus, not a single genome — it represents the average genome of all cells assigned to that taxonomic bin, and within-bin strain diversity is masked.
Biological Interpretation
A genome assembly is a statistical inference, not a ground truth. The N50 metric is widely misused — a high N50 can mask systematic misassembly (collapsed duplications, mis-joined repeats). BUSCO scores measure conserved gene content, not genome-wide structural correctness. A 95% BUSCO score means 5% of conserved genes are missing or fragmented, but it says nothing about whether the remaining 95% are correctly assembled.
Short-read-only assemblies of eukaryotic genomes are fundamentally incomplete. The telomere-to-telomere (T2T) human genome required PacBio HiFi (10-25 kb CCS reads) + ultra-long ONT (>100 kb reads) + manual curation with optical maps. Any eukaryotic assembly derived from short reads alone should be treated as a draft.
For metagenomic binning, completeness and contamination are competing objectives: aggressive binning produces fewer, cleaner MAGs but misses low-abundance organisms; lenient binning captures more diversity at the cost of contaminated bins. A MAG is always a consensus sequence across a population of cells that share the same taxonomic bin, and strain-level variation within the bin is invisible to the consensus.
Current Landscape (Q2 2026)
- T2T consortium completed the first truly complete human genome, revealing ~200 Mb of previously missing sequence including centromeres, telomeres, and segmental duplications that short-read-only assemblies could not resolve.
- Human Pangenome Reference Consortium released the first draft pangenome, replacing the single linear reference with a graph genome representing population diversity across 47 phased assemblies.
- Verkko (hybrid long-read + trio binning) produces haplotype-resolved assemblies for eukaryotic genomes, resolving both haplotypes without parental data when Hi-C data is available.
- Telomere-to-telomere assembly is becoming routine for bacteria and model organisms as HiFi and ultra-long ONT costs decrease.
Summary and Required Reading
- Assembly is NP-hard with short reads — repeats longer than the read length create information-theoretic ambiguities that no algorithm can resolve.
- De Bruijn graphs (k-mers) transform the Hamiltonian (NP-complete) path problem into an Eulerian (\(O(E)\)) path problem, enabling genome-scale assembly.
- k-mer choice is the critical parameter — small \(k\) maximizes connectivity, large \(k\) resolves repeats. Multiple k-mer assemblies address both needs.
- Long reads bridge repeats — Hifiasm (HiFi) and Flye (ONT) produce near-complete eukaryotic assemblies. Short-read-only assemblies are drafts.
- N50 is a contiguity metric, not an accuracy metric — BUSCO for gene content, merqury QV for consensus accuracy.
- Metagenomic binning groups contigs by composition and coverage into MAGs, which are population consensuses, not individual genomes.
Required Reading
- Compeau, Pevzner & Tesler: “How to apply de Bruijn graphs to genome assembly” (Nature Biotechnology, 2011).
- Bankevich et al.: “SPAdes: a new genome assembly algorithm” (Journal of Computational Biology, 2012).
Johnson’s Rule: Assembly is not a solved problem. It is always an approximate reconstruction.