Chapter 9: De Novo Assembly and Graph Algorithms

Published

June 5, 2026

Modified

June 19, 2026

Johnson’s First Principle: Assembly is Not a Solved Problem; It is an Approximate Reconstruction

Reconstructing a genome from fragments is the Shortest Common Superstring Problem, finding the shortest string that contains every read as a substring. This problem is NP-hard. When reads are shorter than repetitive elements, there is no unique mathematical solution regardless of algorithmic sophistication. Assembly is always an approximate reconstruction, and its quality is bounded by the relationship between read length and repeat length.

Core Concepts

The Repeat Trap

The human genome is over 50% repetitive: Alu elements (~300 bp, ~1M copies), LINE1 (~6 kb, ~500K copies), and segmental duplications (10-500 kb). If a read is shorter than a repeat, the assembler cannot determine whether it came from copy A or copy B. Consider the genome A—REPEAT—B and C—REPEAT—D. Reads from the repeat are identical regardless of origin; the assembly graph branches into two diverging paths with no information to resolve which is correct:

\[\text{Repeat length} > \text{read length} \implies \text{assembly graph breaks at repeat boundaries}\]

This is an information-theoretic limit, not an algorithmic weakness. No algorithmic cleverness can recover the correct path across a repeat when the reads are shorter than the repeat unit. The only solution is longer reads or mate-pair information that bridges the repeat.

From Hamiltonian to Eulerian: Why k-mers Change the Problem

Two graph formulations dominate assembly, and the difference between them explains the revolution in algorithmic efficiency.

Overlap-Layout-Consensus (OLC) models each read as a node, with edges connecting reads that overlap by a minimum number of bases. The genome is the Hamiltonian path: visiting each node exactly once, which is NP-complete (\(O(N!)\) in the number of reads). For 10 billion short reads, this is computationally impossible. OLC was viable only for Sanger-era projects with thousands of reads.

De Bruijn graphs transform the problem by fragmenting all reads into k-mers (substrings of length \(k\)) and using k-mers as nodes. This changes the complexity fundamentally: the node vocabulary has a fixed upper bound (\(4^k\) possible k-mers), regardless of how many reads are sequenced. The genome is now an Eulerian path: traversing each edge exactly once; solvable in \(O(E)\) time.

Concrete example. Genome = GATTACA with k=3. The k-mers are GAT, ATT, TTA, TAC, ACA. The De Bruijn graph represents each (k-1)-mer as a node and each k-mer as a directed edge:

Nodes (2-mers): GA  AT  TT  TA  AC  CA
Edges (3-mers): GA→AT  AT→TT  TT→TA  TA→AC  AC→CA

The Eulerian path is GA→AT→TT→TA→AC→CA. Reconstructing the sequence: start node GA (the first k-1 bases), then append the last character of each edge in path order: G + A + T + T + A + C + A = GATTACA. The genome is recovered.

Eulerian path conditions. For a graph to contain an Eulerian path, node degrees must satisfy: one node has out-degree = in-degree + 1 (start), one node has in-degree = out-degree + 1 (end), and all other nodes are balanced. In the GATTACA graph, GA has out=1, in=0 (start) and CA has in=1, out=0 (end). These conditions are checked in \(O(E)\) time.

The NP-hard Hamiltonian problem becomes tractable here because k-mers decouple the problem from read count. OLC has as many nodes as reads (10 billion), and checking all orderings is \(O(N!)\). De Bruijn has at most \(4^k\) possible nodes (theoretically \(4^{31} \approx 4.6 \times 10^{18}\), but typically millions in practice), and Eulerian path is \(O(E)\). The tradeoff is memory: a human genome de Bruijn graph with k=31 stores billions of k-mer occurrences and requires 100-300 GB of RAM.

# De Bruijn graph from k-mers
def build_debruijn(reads, k=31):
    graph = {}
    for read in reads:
        for i in range(len(read) - k + 1):
            kmer = read[i:i+k]
            left = kmer[:-1]
            right = kmer[1:]
            graph.setdefault(left, set()).add(right)
    return graph

The k-mer tradeoff. The choice of \(k\) determines how the graph balances connectivity against repeat resolution. Small \(k\) (21) means each k-mer is shorter and more likely to be shared between reads, creating a more connected graph. But shorter k-mers are also more likely to fall within repeats, collapsing repeat copies into the same node. Large \(k\) (63) distinguishes repeat copies more effectively because the k-mer extends beyond the repeat unit, but the stricter match requirement produces a more fragmented graph in low-coverage regions. Many assemblers address this by using multiple \(k\) values (SPAdes uses a k-mer sweep from 21 to 127) or by building a multi-sized de Bruijn graph.

Graph Cleaning and Resolution

A raw de Bruijn graph contains artifacts from sequencing errors and biological variation:

Tips: Dead-end paths created by sequencing errors at read ends. A single erroneous base at the end of a read creates a k-mer that does not connect to the rest of the graph. Tips are removed if they are shorter than \(2k\) and have no supporting coverage.
Bubbles: Alternative paths caused by heterozygous variants, alternative splicing, or sequencing errors. Two paths diverge and reconverge within a short distance. Bubbles are resolved by keeping the higher-coverage path and removing the lower.
Bulges: Compressed bubbles; short alternative paths caused by a single nucleotide variant. Like bubbles but shorter; removed by the same heuristic.

After cleaning, the graph is resolved into contigs by finding unambiguous paths; nodes with exactly one incoming and one outgoing edge. Branch points (nodes with multiple in/out edges) indicate repeats or unresolved variation and terminate contigs.

The density of artifacts in the raw graph is a direct diagnostic of sequencing error rate. A high-quality dataset (HiFi, PCR-free Illumina) produces a clean graph with few tips; a noisy dataset produces a dense artifact graph that requires aggressive cleaning, which in turn reduces contiguity by removing true low-coverage regions along with errors. The ratio of tips removed to contigs produced is a useful quality metric for the input data.

Long-Read Assembly

Long reads (PacBio HiFi 10-25 kb, ONT 10-100 kb) solve the repeat problem that short reads cannot. A read spanning an entire Alu element (~300 bp) or LINE1 element (~6 kb) determines unambiguously which flanking sequence it connects to. Modern long-read assemblers have largely replaced short-read assembly for eukaryotic genomes.

Hifiasm (HiFi-only): Produces haplotype-resolved assemblies by using Hi-C or trio information to phase variants. The standard for human genome assembly; produces fully phased assemblies in hours.
Flye (ONT + HiFi): Uses a repeat graph construction that handles the high error rates of raw ONT reads. Preferred for metagenomic assembly and bacterial genomes where coverage depth varies.
Canu (legacy PacBio CLR): OLC-based with error correction before assembly. Largely superseded by Hifiasm and Flye for most applications.

Validation Metrics

Metric	What It Measures	Caveat
N50	Contiguity: 50% of assembly in contigs of this size or larger	Does not measure accuracy
L50	Number of contigs to reach N50	High L50 = fragmented
BUSCO	Presence of conserved single-copy orthologs	Measures gene content, not genome-wide correctness
QV / Assembly quality	Phred-scaled consensus error rate (via merqury)	Q40 = 99.99% accurate at consensus level

N50 is the most widely reported, and most widely misused; assembly metric. High N50 does not mean high quality. A contiguous but systematically wrong assembly (mis-joined repeats, collapsed segmental duplications) can achieve high N50 while having low biological accuracy. BUSCO scores measure conserved gene content, not structural correctness. QV estimation via k-mer comparison (merqury) provides the most honest accuracy assessment by comparing k-mer frequencies between the assembly and raw reads.

Genome size estimation. Before assembly, the k-mer frequency spectrum (histogram of k-mer occurrence counts) provides a rapid genome size estimate: \(G = \frac{\text{total distinct k-mers}}{\text{peak k-mer coverage}}\), where the peak in the distribution corresponds to the haploid sequencing depth. If the estimated size deviates substantially from the expected genome size, the data may be contaminated, have unexpected ploidy, or suffer from systematic sequencing bias, catching this before assembly saves hours of failed computation.

Metagenomic Binning

For metagenomic data, the assembly produces a mixture of contigs from multiple organisms. Binning groups contigs into metagenome-assembled genomes (MAGs) using two orthogonal signals:

Composition: Tetranucleotide frequency (TNF) profiles; each genome has a characteristic oligonucleotide signature that distinguishes it from other genomes in the sample.
Coverage: Differential abundance across samples; contigs from the same genome have correlated coverage profiles across multiple samples or conditions.

Tools: CONCOCT, MetaBAT2, MaxBin2. Validation: CheckM assesses MAG quality via lineage-specific single-copy marker genes. An “essentially complete” MAG requires >90% completeness and <5% contamination. A MAG is a population consensus, not a single genome; it represents the average genome of all cells assigned to that taxonomic bin, and within-bin strain diversity is masked.

Biological Interpretation

A genome assembly is a statistical inference, not a ground truth. The N50 metric is widely misused; a high N50 can mask systematic misassembly (collapsed duplications, mis-joined repeats). BUSCO scores measure conserved gene content, not genome-wide structural correctness. A 95% BUSCO score means 5% of conserved genes are missing or fragmented, but it says nothing about whether the remaining 95% are correctly assembled.

Short-read-only assemblies of eukaryotic genomes are fundamentally incomplete. The telomere-to-telomere (T2T) human genome required PacBio HiFi (10-25 kb CCS reads) + ultra-long ONT (>100 kb reads) + manual curation with optical maps. Any eukaryotic assembly derived from short reads alone should be treated as a draft.

For metagenomic binning, completeness and contamination are competing objectives: aggressive binning produces fewer, cleaner MAGs but misses low-abundance organisms; lenient binning captures more diversity at the cost of contaminated bins. A MAG is always a consensus sequence across a population of cells that share the same taxonomic bin, and strain-level variation within the bin is invisible to the consensus.

Summary and Required Reading

Assembly is NP-hard with short reads: repeats longer than the read length create information-theoretic ambiguities that no algorithm can resolve.
De Bruijn graphs (k-mers) transform the Hamiltonian (NP-complete) path problem into an Eulerian (\(O(E)\)) path problem, enabling genome-scale assembly.
k-mer choice is the critical parameter: small \(k\) maximizes connectivity, large \(k\) resolves repeats. Multiple k-mer assemblies address both needs.
Long reads bridge repeats: Hifiasm (HiFi) and Flye (ONT) produce near-complete eukaryotic assemblies. Short-read-only assemblies are drafts.
N50 is a contiguity metric, not an accuracy metric: BUSCO for gene content, merqury QV for consensus accuracy.
Metagenomic binning groups contigs by composition and coverage into MAGs, which are population consensuses, not individual genomes.

Required Reading

Compeau, Pevzner & Tesler: “How to apply de Bruijn graphs to genome assembly” (Nature Biotechnology, 2011).
Bankevich et al.: “SPAdes: a new genome assembly algorithm” (Journal of Computational Biology, 2012).

Johnson’s Rule: Assembly is not a solved problem. It is always an approximate reconstruction.