Chapter 19: 3D Genome Architecture

Johnson’s First Principle: The Genome is Folded in Three Dimensions

Linear proximity on a chromosome is a false indicator of regulatory interaction. A promoter and its enhancer may be separated by 500 kilobases of linear sequence but brought into physical contact through chromatin looping. The linear genome is a coordinate system; the 3D genome is the functional reality.

Chapter 12 introduced the epigenome as software running on the DNA hardware. But software execution depends not just on which proteins are bound and which regions are accessible — it depends on which genomic elements are physically brought together in 3D space. An enhancer cannot activate a promoter if it is in a different physical compartment of the nucleus, regardless of how accessible the chromatin is at each element individually. Three-dimensional genome architecture is the spatial constraint that regulatory logic operates within.


Core Concepts

Proximity Ligation and Hi-C Physics

Hi-C measures physical proximity across the genome. The biochemical logic is simple: crosslink spatially adjacent chromatin with formaldehyde (fixing proteins and DNA that are within ~1 nm of each other), digest with a restriction enzyme (typically HindIII or MboI), ligate the crosslinked fragments back together (creating chimeric junctions between loci that were physically close in 3D space), reverse crosslinks, and sequence the junctions. Each sequencing read pair maps to two different genomic loci — the closer they are in 3D space, the more frequently they are ligated together.

The output is a contact matrix \(M_{ij}\), where each entry records the number of ligation events observed between locus \(i\) and locus \(j\) across millions of cells. The matrix is symmetric (\(M_{ij} = M_{ji}\)) because proximity ligation is bidirectional — only the upper triangle contains unique information. For a genome binned at 1 Mb resolution, this produces a 3,000 \(\times\) 3,000 contact matrix. At the highest sequencing depth (5-10 billion reads), kilobase-resolution contact maps are possible in targeted regions.

Resolution scales with sequencing depth. Resolution is defined by the bin size, and the required read depth scales quadratically as bins shrink: 1 Mb bins need ~20 million read pairs, 100 kb bins need ~200 million, and 10 kb bins need ~5 billion. Halving the bin size doubles the number of bins along each axis, quadrupling the total entries — most of which remain sparse at high resolution. In practice, compartment analysis uses 100-500 kb bins, TAD boundaries are visible at 10-50 kb, and individual loop detection requires 5-10 kb bins with correspondingly deep sequencing.

The decay law and normalization. Contact frequency decays with genomic distance following a polymer physics scaling law: \(M_{ij} \sim |i - j|^{-1}\) for interphase chromatin (or \(|i - j|^{-2}\) for mitotic chromosomes). This means that nearby loci (1-10 kb apart) always have high contact frequency regardless of biological loop formation. This distance-dependent background must be removed before biological signal can be interpreted. The standard approach is observed-over-expected (obs/exp) normalization: for each genomic distance \(s = |i - j|\), compute the expected contact frequency as the average over all locus pairs at that distance, then divide each entry by its distance-matched expectation. After obs/exp normalization, a value \(> 1\) means “more contacts than expected given the genomic distance between the two loci” — this is how TAD boundaries and loop anchors appear as enriched features in the normalized matrix.

Iterative Correction and Eigenvector decomposition (ICE) and Knight-Ruiz (KR) matrix balancing normalize the contact matrix by removing systematic biases: GC content bias (GC-rich regions are more mappable), restriction site density bias (fragments with more restriction sites produce more ligation products), and fragment length effects (longer fragments have higher ligation probability). Both methods model biases as multiplicative factors: the observed contact count \(M_{ij}\) equals the true contact frequency \(T_{ij}\) multiplied by bias factors \(b_i\) and \(b_j\) for each locus (\(M_{ij} = T_{ij} \cdot b_i b_j\)). Under the assumption that all loci have equal “visibility” in the absence of biological structure, the correction solves for \(b_i\) such that each row sum becomes uniform. After dividing by the bias factors, the normalized matrix \(M_{ij} / (b_i b_j)\) reflects biological contact frequency independent of sequence-level biases. KR is faster; ICE handles sparser matrices better.

Topological Structures

The normalized contact matrix reveals hierarchical 3D structures at three scales:

A/B Compartments are the largest scale (multimegabase, typically 5-100 Mb). The first principal component of the normalized contact matrix (or the first eigenvector of the correlation matrix) partitions the genome into two states: A compartments are gene-rich, transcriptionally active, early replicating, and enriched for activating histone marks (H3K27ac, H3K4me3). B compartments are gene-poor, transcriptionally silent, late replicating, and enriched for repressive marks (H3K9me3, H3K27me3). The compartment signal is visible as a “plaid” pattern in the contact matrix — checkerboard enrichment indicating that A loci preferentially contact other A loci, and B loci preferentially contact other B loci.

Topologically Associating Domains (TADs) are megabase-scale (typically 200 kb - 2 Mb) regions within which contact frequency is enriched relative to the background. TADs are visible as triangular blocks along the diagonal of the contact matrix. TAD boundaries are sharply demarcated by convergent CTCF motifs bound by CTCF and cohesin. Disruption of TAD boundaries (by deletion of the CTCF site or mutation of CTCF itself) causes ectopic enhancer-promoter contacts across the boundary, leading to misregulation of developmental genes. This mechanism is directly implicated in developmental disorders (such as limb malformations from boundary deletions in the EPHA4 locus) and cancer (TAD boundary disruptions causing oncogene activation by distal enhancers).

Enhancer-Promoter loops are sub-Mb contacts identified by significant contact enrichment relative to the local TAD background. These are the functional units of gene regulation — a physical connection between a distal enhancer and its target promoter through the cohesin-CTCF extrusion machinery. Methods for loop identification (Fit-Hi-C, HiCCUPS, Mustache) differ primarily in how they model the background distribution: Fit-Hi-C fits a spline to the distance-decay curve and tests each contact against it; HiCCUPS searches for local enrichment relative to the surrounding region in the contact matrix; Mustache uses a binomial model with FDR control.

Loop Extrusion Model

The dominant mechanistic model for loop and TAD formation: the cohesin complex loads onto chromatin and actively extrudes a loop, reeling the chromatin fiber through its ring structure until it encounters CTCF proteins bound in a specific orientation (convergent motifs). The CTCF-cohesin interaction stalls extrusion, creating a stable loop anchored at the two CTCF sites. This model explains three key observations:

  1. Directional bias: CTCF motifs at loop anchors are almost exclusively in convergent orientation (→ ←), never divergent (← →). The extrusion model predicts this — cohesin can only be stopped by CTCF when it encounters the motif from the correct direction.
  2. Nested hierarchy: TADs contain sub-TADs which contain finer loops. The extrusion model explains this as partial extrusion — a loop can be stalled at early CTCF sites or continue to more distant ones.
  3. Size distribution: Loops show a characteristic distribution peaking at 100-500 kb, consistent with the processivity of cohesin extrusion before stalling.

Biological Interpretation

A contact matrix is a population average across millions of cells. A strong contact in bulk Hi-C may represent a constant loop in all cells, or a statistically significant contact that occurs in only 5% of cells but is enriched far above the background expectation. Single-cell Hi-C and imaging-based methods (DNA-MERFISH, Oligopaint) have shown that TADs and compartments are stochastic at the single-cell level: any given cell has only a fraction of the loops detected in bulk, but the population average converges to the structures we call TADs.

The biological interpretation question is not “does this loop exist?” but “in which cells, at what frequency, and under what conditions does this loop form?” Loops that are constitutive (present in most cells across conditions) are likely structural — they organize the genome into a stable architecture. Loops that form only in specific cell types or after stimulation are likely regulatory — they mediate cell-type-specific or signal-responsive gene expression.

A/B compartment status correlates with transcriptional activity, but the direction of causality is bidirectional. Active transcription can drive a locus into the A compartment (transcription opens the region), and compartment assignment can influence transcription (spatial proximity to active regions promotes expression). Disentangling directionality requires perturbation experiments — for example, tethering a locus to the nuclear lamina (forced B compartmentalization) and measuring the transcriptional effect.

The linear-to-3D mapping challenge is substantial. A GWAS variant 500 kb from the nearest gene may regulate that gene through a loop, or it may regulate a completely different gene brought into contact by the 3D architecture. Variant-to-function mapping that only considers linear distance to the nearest TSS will systematically misattribute non-coding variants to the wrong target gene.


Current Landscape (Q2 2026)

  • Pore-C and Pore-C-HiFi combine Hi-C with long-read sequencing to map multi-way contacts and haplotypes simultaneously, revealing simultaneous three-way and four-way chromatin interactions that bulk pairwise Hi-C cannot detect.
  • Evo2HiC (2026) predicts Hi-C contact matrices from DNA sequence alone, enabling 3D genome prediction in any cell type without experimental Hi-C data — but predictions degrade for cell-type-specific loops, suggesting they are driven by epigenomic state rather than sequence.
  • DNA-MERFISH and seqFISH+ now image dozens to hundreds of genomic loci simultaneously at single-cell resolution, revealing that bulk TAD boundaries are probabilistic: a given boundary is present in only 40-70% of individual cells.
  • AI-predicted 3D genome folding (Orca, DeepC, Enformer) approaches experimental resolution for promoter-enhancer contact prediction, enabling in silico perturbation of TAD boundaries to predict the regulatory consequences of structural variants.

Summary and Required Reading

  1. Hi-C measures 3D proximity via proximity ligation — contact frequency decays as \(\sim 1/s\) with distance; ICE/KR normalization removes sequence biases to reveal biological structure.
  2. Three hierarchical structures — A/B compartments (multimegabase, active vs. silent), TADs (megabase, regulatory neighborhood), enhancer-promoter loops (sub-Mb, functional contacts).
  3. The loop extrusion model explains convergent CTCF motif orientation, nested TAD hierarchy, and loop size distribution through cohesin-mediated loop extrusion stalled by CTCF.
  4. Bulk Hi-C is a population average — single-cell methods reveal that TADs and loops are stochastic, not deterministic, structures.
  5. 3D genome interpretation requires perturbation — compartment status and transcription have bidirectional causality; static measurements cannot resolve direction.

Required Reading

  • Lieberman-Aiden et al.: “Comprehensive mapping of long-range interactions reveals folding principles of the human genome” (Science, 2009).
  • Rao et al.: “A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping” (Cell, 2014).
  • Fudenberg et al.: “Formation of chromosomal domains by loop extrusion” (Cell Reports, 2016).

Johnson’s Rule: Linear distance does not equal regulatory distance. If you are only looking at genes and their nearest neighbors, you are missing the regulatory architecture.