Chapter 25: DNA Foundation Models and Genomic AI

Johnson’s First Principle: The Genome is a Language Waiting to Be Read

DNA is a 4-letter code with a grammar shaped by 4 billion years of evolution. The same transformer and state-space architectures that parse human language can learn this grammar from raw sequence — and once learned, the model can predict the functional consequences of any sequence change, anywhere in the genome, without task-specific training.

Chapter 24 introduced foundation models for proteins, where 3D structure provides an interpretable ground truth for evaluating learned representations. DNA foundation models face a harder problem: the functional output of most genomic sequence is not a folded structure but regulation of gene expression — a phenomenon that is tissue-specific, developmental-stage-specific, and mediated through chromatin state and 3D genome architecture (Chapters 12, 19). Where protein models predict structure, DNA models must predict regulatory function across diverse genomic and cellular contexts.

Core Concepts

Why DNA Foundation Models Differ from Protein Models

DNA operates at vastly larger scale (3 billion base pairs vs. ~400 amino acids per protein). DNA models must handle long-range dependencies (enhancers 1 Mb from promoters), non-coding regions (98% of the genome), and multiple functional layers (sequence → chromatin → expression → splicing → phenotype).

Protein models (ESM, AlphaFold) predict structure. DNA models predict regulatory function: chromatin accessibility, transcription factor binding, splicing, expression.

Architecture Innovations for Long DNA Sequences

Pure transformers scale quadratically with sequence length O(L²), making whole-genome context prohibitive. Key architectural innovations:

Hyena / StripedHyena (used by Evo2): replaces attention with implicit convolutions and gating, achieving O(L log L) scaling. Enables 1 megabase context windows.

Enformer / Borzoi: convolutional + transformer hybrid that processes 200 kb windows to predict RNA-seq coverage tracks.

Caduceus / MambaDNA: state-space models (SSMs) that process DNA bidirectionally with linear scaling.

Training Paradigm

DNA foundation models are trained with masked language modeling (mask 15% of nucleotides, predict the masked bases) or next-token prediction across large, diverse genomic corpora. The MLM loss follows the same formulation from Chapter 24:

\[\mathcal{L} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}, \theta)\]

where \(\mathcal{M}\) is the set of masked positions and \(x_{\setminus \mathcal{M}}\) is the unmasked context. Unlike protein models where each position is one of 20 amino acids, each DNA position is one of 4 nucleotides — the per-position classification is easier, but the long-range dependencies the model must capture are far more challenging. Evo2 was trained on 9.3 trillion nucleotides from 128,000 species across all domains of life.

The training objective implicitly encodes: evolutionary conservation (conserved bases are easier to predict), mutation tolerance, regulatory grammar (transcription factor binding motifs), and codon structure in coding regions.

Phylogenetic breadth vs. species specificity. Models trained across hundreds of thousands of species (Evo2) learn deeply conserved regulatory grammar — the motifs, codon patterns, and structural constraints shared across domains of life. Species-specific models (Nucleotide Transformer trained on human only) capture human-specific regulatory syntax but risk overfitting to neutral variation with no functional constraint. The choice is a tradeoff: broad training improves zero-shot prediction on conserved elements (splice sites, essential genes), while species-specific models better capture taxon-restricted regulatory elements (primate-specific enhancers, Alu-derived regulatory sequences). Most practical applications use both: a broad model for conservation-aware prioritization and a species-specific model for fine-grained regulatory annotation in the target genome.

Key Capabilities

Zero-shot variant effect prediction: given a reference sequence \(x_{\text{ref}}\) and an alternate allele \(x_{\text{alt}}\), the model’s log-likelihood ratio predicts functional impact:

\[\Delta\text{LL} = \log P(x_{\text{alt}} \mid \text{model}) - \log P(x_{\text{ref}} \mid \text{model})\]

A negative \(\Delta\text{LL}\) means the model considers the alternate sequence less “likely” — it deviates from the evolutionary grammar learned across millions of sequences. Evo2 achieves >90% accuracy distinguishing pathogenic from benign BRCA1 variants using this score alone, without any task-specific training — no labeled data, no fine-tuning.

Sequence design: autoregressive models can generate novel regulatory sequences, protein-coding genes, and even small genomes (Evo2 generated functional bacteriophage genomes at 16% success rate).

Contextual learning (2026 discovery): genomic next-token predictors exhibit in-context learning — the ability to perform tasks specified only by the input context, analogous to LLM in-context learning.

Biological Interpretation

A DNA foundation model that predicts a variant as “deleterious” has learned that this nucleotide is conserved across evolution and unlikely to appear by chance. This is a statement about evolutionary constraint, not necessarily about molecular mechanism. A high deleteriousness score means “evolution says this position is important” — it does not tell you which molecular function (binding, splicing, structure) is disrupted.

Zero-shot prediction is powerful but not perfect: models are biased by training data composition (species representation, GC content), and their predictions are less reliable in genomic regions that are poorly represented in the training corpus (centromeres, telomeres, recent duplications).

Current Landscape (Q2 2026)

  • Evo2 (Arc Institute / NVIDIA, 2026, Nature) is the largest fully open biological AI model: 40B parameters, 9.3 trillion tokens, 128K species. Open weights, open training data (OpenGenome2).
  • Nucleotide Transformer v2 (InstaDeep) provides efficient encoding of regulatory DNA with context lengths up to 12 kb.
  • Enformer fine-tuning achieves state-of-the-art eQTL and GWAS variant prioritization by predicting cell-type-specific regulatory effects.
  • Evo2HiC predicts 3D genome contact maps from sequence alone, 500x faster than experimental Hi-C.
  • A 2026 systematic benchmark showed that zero-shot variant effect prediction from DNA foundation models matches or exceeds supervised methods trained on thousands of labeled examples.

Summary and Required Reading

  1. DNA foundation models differ from protein models — the genome is 3 Gbp with 98% non-coding DNA; models must handle long-range regulatory interactions (enhancers 1 Mb from promoters) and predict regulatory function, not 3D structure.
  2. Architectural innovations (Hyena, Mamba, Enformer) overcome the O(L²) scaling of pure transformers, enabling context windows from 200 kb (Enformer) to 1 Mb (Evo2).
  3. Masked language modeling on genomic sequences learns evolutionary conservation, mutation tolerance, regulatory grammar, and codon structure from unlabeled data across thousands of species.
  4. Zero-shot variant effect prediction distinguishes pathogenic from benign variants without task-specific training — a statement about evolutionary constraint, not molecular mechanism.
  5. Validation is essential — zero-shot predictions degrade in poorly-represented genomic regions (centromeres, telomeres, recent duplications), and generative DNA models produce functional sequences at low rates (~16% for synthetic bacteriophage genomes).

Required Reading

  • Brixi et al.: “Genome modeling and design across all domains of life with Evo 2” (Nature, 2026)
  • Avsec et al.: “Effective gene expression prediction from sequence by integrating long-range interactions” (Nature Methods, 2021) — Enformer
  • Nguyen et al.: “HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution” (NeurIPS, 2023)
  • Dalla-Torre et al.: “The Nucleotide Transformer: building and evaluating robust foundation models for human genomics” (Bioinformatics, 2024)

Johnson’s Rule: A foundation model’s prediction is a hypothesis about evolutionary constraint, not a molecular assay. Validate before acting.