Chapter 24: Foundation Models and Generative AI in Biology
Johnson’s First Principle: Biology is a Language
DNA: a 4-letter alphabet, sequence → function. Proteins: a 20-letter alphabet, sequence → 3D structure → function. The same architectures that model human language (transformers) model biological sequences. The reason transformers work on biology is that biology is combinatorial, hierarchical, and contextual — just like language.
Chapter 23 showed how neural networks learn hierarchical features — first-layer CNN kernels detect short sequence motifs, deeper layers detect regulatory combinations. Transformers generalize this hierarchical learning to arbitrary-length dependencies: a single attention head can connect a promoter to an enhancer 100 kb away, which no CNN of reasonable depth can do. This chapter extends deep learning from supervised neural networks to self-supervised foundation models that learn biological grammar from unlabeled data.
Core Concepts
The Transformer Architecture
Self-attention from first principles. Each position in the input projects its embedding into three spaces through learned weight matrices: \(Q = XW_Q\) (query), \(K = XW_K\) (key), \(V = XW_V\) (value). The attention score between position \(i\) and position \(j\) is the dot product \(Q_i \cdot K_j\) — how much should position \(i\) attend to position \(j\)? The full attention matrix is \(QK^T\), scaled and normalized:
\[Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
The softmax converts each row into a probability distribution over positions; multiplying by \(V\) produces a weighted sum of all values, weighted by attention. The Q, K, V projections are the learned parameters that make attention adapt to the task — they transform raw token representations into the spaces where compatibility is measured and information is extracted.
The \(\sqrt{d_k}\) scaling. Dot products grow with dimension. In \(d_k = 512\), even random vectors produce dot products of magnitude ~\(\sqrt{512} \approx 22.6\), pushing softmax into a near-one-hot regime where gradients vanish. Dividing by \(\sqrt{d_k}\) restores well-behaved gradients (the vanishing gradient problem from Chapter 23 reappears in attention if this scaling is omitted).
Permutation invariance and positional encoding. Self-attention alone is order-agnostic: “the cat sat” and “sat cat the” produce identical \(QK^T\) matrices because the attention score between any two tokens depends only on their content, not their positions. Positional encoding breaks this symmetry by adding position-dependent signals to each token embedding. The original transformer used fixed sine/cosine waves of different frequencies; modern models typically learn positional embeddings from data. Either approach ensures that “cat” at position 2 and “cat” at position 10 attend to different neighbors.
Multi-head attention. A single attention head learns one relationship pattern. Eight heads in parallel learn complementary patterns — one head might attend to local syntax, another to long-range dependencies (an enhancer 100 kb from its promoter). Head outputs are concatenated and projected back to the model dimension: \(MultiHead(Q,K,V) = Concat(head_1, \ldots, head_h)W_O\).
The residual stream. Every sub-layer is followed by \(x = LayerNorm(x + Sublayer(x))\): a residual connection plus layer normalization. Residual connections (Chapter 23) let gradients bypass layers during backpropagation, enabling 12-96 layer stacks. Layer normalization stabilizes each token’s hidden state to zero mean and unit variance across the hidden dimension.
Why transformers beat RNNs. RNNs process tokens sequentially — the hidden state at position \(t\) depends on position \(t-1\), so training cannot be parallelized. Transformers process all tokens simultaneously: the attention matrix is a single tensor operation, fully parallelizable on GPUs. Long-range dependencies are also fundamentally different: self-attention connects position 1 to position 10,000 in one operation (RNNs need 10,000 sequential steps).
The cost of parallelism is quadratic memory. The attention matrix is \(L \times L\), so memory grows as \(O(L^2)\). For a 400-residue protein this is negligible; for a 100 kb DNA sequence it is prohibitive (\(10^{10}\) entries). Solutions: linear attention (Performer approximates \(softmax(QK^T)\) with kernel feature maps), sparse attention (BigBird restricts each position to local + random + global positions), and convolution-based alternatives (Hyena replaces attention with implicit convolutions, achieving \(O(L \log L)\)).
Encoder vs. decoder architectures. The original transformer has two stacks serving different purposes, and modern biology uses all three variants. Encoder-only models (ESM) use bidirectional self-attention — every token attends to every other token — producing contextualized embeddings ideal for representation learning and classification. Decoder-only models (GPT-style protein generators) use masked self-attention — each token attends only to previous tokens — making them autoregressive and ideal for sequence generation. Encoder-decoder models (AlphaFold) use an encoder to produce representations and a decoder with cross-attention (decoder queries attend to encoder keys/values), enabling sequence-to-structure mapping. Practical rule: encoder-only for embedding and classification, decoder-only for generation, encoder-decoder for mapping between modalities.
Self-Supervised Learning
Paradigm shift: traditional ML requires labeled data; self-supervised uses the data itself as the label.
Masked Language Modeling (BERT-style): mask 15% of amino acids in a protein sequence. Train the model to predict the masked residues. The model learns evolutionary constraints, structure, and function without labels.
Auto-Regressive Modeling (GPT-style): predict the next token given previous tokens. Generates novel sequences (protein design).
Why this changed bioinformatics: ESM (Evolutionary Scale Modeling) trained on 250M protein sequences learned the grammar of protein folding without seeing a single structure. ESM embeddings provide dense vector representations capturing structure and function.
Protein Language Models and AlphaFold
ESM-2: inputs amino acid sequence, outputs per-residue embedding vectors. Applications include predicting function from sequence, predicting structure from sequence (ESMFold), and designing new proteins (ProteinMPNN). Key result: ESM embeddings capture evolutionary information without a multiple sequence alignment.
AlphaFold2: inputs amino acid sequence + MSA. Architecture: the Evoformer maintains a pair representation — a 2D tensor \(z_{ij}\) for every residue pair \((i,j)\) that encodes predicted distance and orientation relationships. The Evoformer applies attention across both MSA rows (which sequences co-vary?) and MSA columns (which positions are conserved?), then updates the pair representation through triangular attention and triangular multiplicative updates that enforce geometric consistency — if residue A is close to B and B is close to C, then A must be close to C. The structure module converts the pair representation to 3D coordinates using Invariant Point Attention (IPA) , which operates on 3D points rather than abstract embeddings: the attention score between two residues depends on their predicted 3D distance. A recycling mechanism runs the model for multiple iterations (typically 3-8), feeding the output structure back as input to progressively refine predictions. Accuracy: near-experimental for single-domain proteins (<1Å backbone RMSD). Limitations: multimer complexes, conformational dynamics, disordered regions.
AlphaFold’s impact: 200M+ protein structures predicted (covering all known proteins). Drug discovery applications in target structure identification and virtual screening.
The Generative Frontier: Designing New Biology
Diffusion models (used by RFdiffusion) learn to reverse a noising process. The forward process is a fixed Markov chain: starting from a real protein backbone \(x_0\), Gaussian noise is added at each timestep according to a predetermined schedule, until \(x_T\) is pure noise. This requires no learned parameters. The reverse process is learned: a neural network (typically a U-Net or transformer) predicts the noise added at each timestep, enabling step-by-step denoising from random noise \(x_T\) back to a valid backbone \(x_0\). The training objective is straightforward — given a noisy structure \(x_t\), predict the noise \(\epsilon\) that was added — and conditioning on pocket sequence, target structure, or desired function is achieved by providing these as additional inputs to the denoising network. RFdiffusion enables generating millions of candidate protein backbones in silico; a subset pass computational filters (structural plausibility, binding energy) and proceed to wet-lab validation.
Generative models for drug discovery (REINVENT, MolDQN) use reinforcement learning for molecular generation conditioned on binding site structure.
The hallucination problem: generative models can create sequences that look plausible but are not functional. Validation bottleneck: only a fraction of designed molecules pass assay tests. AI-designed drugs are entering clinical trials but none have passed Phase III yet.
Biological Interpretation
A foundation model prediction is a hypothesis, never a fact. The self-supervised training objective (predict masked tokens, predict next token) does not inherently learn causality — it learns statistical co-occurrence. A model that accurately predicts masked amino acids may encode evolutionarily permissible substitutions, but this does not mean it understands the biological function of those residues.
AlphaFold’s predicted structures are remarkably accurate for single-domain proteins, but the confidence metrics (pLDDT, PAE) must be inspected per-residue. A high-confidence prediction in a well-structured domain is near-experimental. A low-confidence prediction in a disordered region is not a failure — it is the correct identification of physical disorder.
Generative protein design faces a fundamental asymmetry: generating sequences that pass computational filters (structural plausibility, evolutionary likelihood) is easy. Generating sequences that actually function in a cell is hard. The validation bottleneck — not the generative model — is the rate-limiting step in AI-driven protein design.
The transformer and self-supervised learning principles in this chapter extend directly to DNA. Chapter 25 shows how the same architectures scale to whole-genome models — the 4-letter nucleotide code, the 3 Gbp genome, and a regulatory grammar that spans megabases. Where protein models predict structure, DNA models predict regulatory function; the architectural innovations (Hyena, Mamba, Enformer) are direct responses to the \(O(L^2)\) scaling problem introduced in this chapter.
Current Landscape (Q2 2026)
- AlphaFold3 (2024) introduces a diffusion-based architecture replacing the Evoformer, enabling unified prediction of proteins, nucleic acids, small molecules, and ions.
- ESM3 (2025) from EvolutionaryScale introduces multi-modal generation (sequence + structure + function), enabling prompt-based protein design.
- RFdiffusion and ProteinMPNN have been combined into end-to-end pipelines for de novo binder design, with several candidates now in preclinical validation.
- Critical 2025 benchmark: blinded evaluations show that AI-designed proteins still fail at ~80-90% rate in wet-lab validation, emphasizing the hallucination gap.
Summary and Required Reading
- Self-attention computes \(softmax(QK^T/\sqrt{d_k})V\) where Q, K, V are learned projections of the input — the \(\sqrt{d_k}\) scaling prevents softmax saturation at high dimensions, and positional encoding breaks the permutation invariance of raw self-attention. Encoder-only, decoder-only, and encoder-decoder architectures serve different biological tasks (embedding, generation, mapping respectively).
- Self-supervised learning (masked language modeling, autoregressive prediction) enables models to learn evolutionary constraints, structure, and function from unlabeled sequence data alone.
- Protein language models (ESM) capture folding grammar without MSAs — ESM embeddings provide dense representations for structure and function prediction.
- AlphaFold achieves near-experimental accuracy for single-domain protein structures — but pLDDT and PAE must be inspected per-residue; disordered regions are correctly identified as low-confidence, not failures.
- Generative models (RFdiffusion, ProteinMPNN) can design novel proteins — but the validation bottleneck limits success: ~80-90% of AI-designed proteins fail in wet-lab validation.
Required Reading
- Vaswani et al.: “Attention is All You Need” (NeurIPS, 2017) — the original transformer.
- Devlin et al.: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (NAACL, 2019) — masked language modeling.
- Jumper et al.: “Highly accurate protein structure prediction with AlphaFold” (Nature, 2021).
- Lin et al.: “Evolutionary-scale prediction of atomic-level protein structure with a language model” (Science, 2023) — ESM-2.
- Watson et al.: “De novo design of protein structure and function with RFdiffusion” (Nature, 2023).
Johnson’s Rule: A foundation model prediction is a hypothesis. Validate it experimentally before publishing.