Chapter 12: Epigenomics and Hidden Signals

Johnson’s First Principle: DNA is Hardware, Epigenetics is Software

Every cell contains the same 3.2 billion base pairs. A neuron, a liver cell, and a skin cell differ not in their DNA sequence but in which genes are accessible and expressed. A mutation is a hardware defect; an epigenetic change is a software bug — reversible, context-dependent, and programmable.

This analogy is more literal than it appears. DNA sequence (hardware) constrains what is possible — a gene cannot be expressed if it does not exist, and a regulatory element cannot function if its sequence is deleted. But the epigenome (software) determines what is actually executed — which promoters are accessible, which enhancers are active, which genes are silenced. Two cells with identical genomes can execute entirely different programs.

The techniques in this chapter are complementary probes of the software stack. No single technique tells the full story: ChIP-seq identifies which proteins are bound to DNA, ATAC-seq reveals where chromatin is accessible, motif analysis reads the regulatory code embedded in the DNA sequence itself, and DNA methylation measures a persistent chemical mark that stabilizes gene expression programs across cell divisions. Together, they form a multi-layered picture of how the same genome produces different cellular outcomes.


Core Concepts

Layer 1: Where Proteins Bind — ChIP-seq and CUT&Tag

Chromatin Immunoprecipitation sequencing (ChIP-seq) maps genome-wide binding sites of a specific protein. The biochemical protocol is straightforward: crosslink proteins to DNA with formaldehyde, sonicate to 200-600 bp fragments, immunoprecipitate with an antibody targeting the protein of interest, reverse crosslinks, and sequence the enriched DNA fragments.

The signal interpretation depends entirely on the antibody target:

Target Biological Meaning Peak Shape
H3K27ac Active enhancers and promoters Narrow peaks
H3K4me3 Active promoters Narrow peaks at TSS
H3K27me3 Polycomb-repressed chromatin Broad domains (kb-Mb)
H3K9me3 Heterochromatin Broad domains
CTCF Insulator binding / chromatin looping Narrow peaks
POLR2A Active transcription Narrow peaks at TSS + gene body

Antibody specificity is the dominant factor in ChIP-seq quality. A poorly validated antibody pulls down off-target proteins, producing clean-looking peaks at biologically irrelevant sites that pass all computational filters. The essential quality metric is the FRiP score (Fraction of Reads in Peaks): for a well-characterized transcription factor, FRiP > 5% indicates enrichment above background, while FRiP < 1% suggests the antibody failed. There is no computational method that can rescue a poor antibody — the enrichment simply is not there.

Broad vs. narrow peak calling. Histone modifications fall into two structurally distinct classes. Narrow marks (H3K27ac, H3K4me3, CTCF) produce focal enrichment at specific regulatory elements, typically 200-1000 bp. Broad marks (H3K27me3, H3K9me3) produce diffuse enrichment across tens of kilobases to megabases. MACS2 handles both via its --broad flag, which changes the background model to detect contiguous regions of enrichment rather than isolated peaks. Using narrow-peak mode on a broad histone mark cuts each domain into hundreds of fragmented “peaks” that violate the assumption of independent binding events — each “peak” is not an independent biological observation.

CUT&Tag (Cleavage Under Targets and Tagmentation) has largely replaced ChIP-seq for histone modifications in many applications. A protein-A-Tn5 fusion is guided to the target protein by a primary antibody, and the Tn5 is activated by adding magnesium, which cuts and tags DNA locally. Advantages: lower cell input (100-1,000 cells vs. \(10^6\) for ChIP-seq), higher signal-to-noise ratio, native (non-crosslinked) conditions that avoid epitope masking. The tradeoff: CUT&Tag requires a high-quality primary antibody with high affinity, and the short tagmentation fragments (~150 bp) are less informative for broad domain calls.

Layer 2: Where Chromatin is Open — ATAC-seq

ATAC-seq maps open chromatin without targeting a specific protein. The Tn5 transposase simultaneously cuts DNA and inserts sequencing adapters, preferentially accessing nucleosome-free regions. The protocol is remarkably simple: ~3 hours from cells to library, no crosslinking, no antibody. Native reaction: intact nuclei are incubated with Tn5, which integrates adapters preferentially into accessible chromatin. The resulting fragments are PCR-amplified and sequenced.

Property ChIP-seq ATAC-seq
Target Specific protein Open chromatin
Antibody Required None
Protocol 2-3 days ~3 hours
Cell requirement \(10^6\) 500-50,000
Signal interpretation Binding site locations Accessibility (regulatory potential)
Known bias Antibody specificity Tn5 sequence preference (GC-rich)

ATAC-seq cannot identify which transcription factor is bound at each accessible site. Open chromatin is a necessary condition for transcription factor binding but not sufficient — an accessible region may be occupied by any of dozens of factors with motifs in that region, or by none at all. Determining which factor is bound requires either:

  1. Motif analysis (below): which transcription factor binding motifs are enriched in the accessible regions
  2. Orthogonal ChIP-seq for the specific factor of interest
  3. Footprinting: Tn5 cutting is blocked where a transcription factor is bound, creating a depletion of cut sites at the motif position. The nucleotide-resolution cutting pattern reveals “footprints” of ~15-30 bp where the factor protects DNA from integration — detectable as local dips in the Tn5 cut frequency within open peaks.

The nucleosome signal in ATAC-seq provides an internal quality metric. A successful assay shows a strong ~147 bp periodicity (mononucleosomal fragments) alongside the sub-nucleosomal open chromatin fragments (< 100 bp). Loss of the nucleosomal ladder indicates over-digestion; absence of sub-nucleosomal fragments indicates under-digestion or insufficient transposase activity.

MACS2 Peak Calling: Poisson Background and Signal Detection

MACS2 models the background read distribution as a Poisson process. As established in Chapter 4, the Poisson models sampling noise when events occur independently at a constant rate — which describes the null hypothesis for ChIP-seq: in the absence of binding, sonicated DNA fragments are randomly distributed across the genome. For a window of length \(w\), the expected number of reads under the null hypothesis is:

\[\lambda_{\text{local}} = \frac{\text{reads in local region}}{\text{length of local region}} \times w\]

A peak is called if the observed read count exceeds \(\lambda_{\text{local}}\) with a Poisson p-value below a threshold (typically \(p < 10^{-5}\) for narrow peaks). MACS2 applies double-peak correction — shifting each tag toward the 3’ direction by half the estimated fragment length — to reconstruct the true binding site position from the sonicated fragment ends.

Significance vs. effect size. The Poisson p-value quantifies statistical confidence, but a peak with a tiny p-value may still have negligible biological effect. A deeply sequenced region (100x coverage) can produce \(p < 10^{-10}\) for a peak with only 1.5-fold enrichment over background — statistically significant but biologically questionable. MACS2 reports both the p-value and a fold-enrichment score (observed signal ÷ local background). The recommended filtering threshold for transcription factor ChIP-seq is fold-enrichment > 2 in addition to \(p < 10^{-5}\). A peak that passes the p-value threshold but not the fold-change threshold is likely a high-coverage artifact rather than a genuine binding event.

The choice of \(\lambda\) estimator matters. MACS2 offers three options: \(\lambda_{\text{local}}\) (a window around the candidate peak), \(\lambda_{\text{global}}\) (genome-wide average), and \(\lambda_{\text{per-chromosome}}\). \(\lambda_{\text{local}}\) is the most conservative and is recommended for transcription factors with focal binding. \(\lambda_{\text{global}}\) overestimates the expected background in open regions and underestimates it in closed regions, distorting the p-value distribution. For broad histone marks, MACS2’s \(\lambda_{\text{local}}\) is averaged over a larger region (10 kb default) to match the broader enrichment scale.

IDR: Irreproducible Discovery Rate. The IDR framework quantifies cross-replicate consistency. It models the joint distribution of peak rankings from two replicates as a mixture of reproducible signals and irreproducible noise, fitting a copula mixture model to separate the two components. The IDR score for each peak is the posterior probability that it belongs to the irreproducible component.

From IDR to consensus peaks. The IDR analysis produces three sets of peaks for each replicate: peaks below the IDR threshold (e.g., IDR < 0.05), peaks above the threshold but with significance in one replicate (IDR ≥ 0.05 but passing the individual peak caller), and peaks with no significance (failed peaks). The final consensus peak set is typically defined as the union of IDR-passing peaks across both replicates, not the intersection — requiring a peak to pass IDR in both replicates is too stringent because genuine binding sites at the edge of detectability may pass IDR in one replicate and just miss it in the other. The union of IDR < 0.05 peaks from both replicates captures the maximal reproducible signal while excluding the “peaks” that appear in only one replicate.

Critical limitation: IDR corrects for technical noise but not for systematic antibody bias. If the same poor antibody is used in both replicates, IDR will produce perfectly reproducible but biologically wrong results — replicate consistency is necessary but not sufficient for biological validity.

Layer 3: Reading the Regulatory Code — Motif Analysis

ChIP-seq tells you where a transcription factor binds; motif analysis tells you what sequence features recruit binding. These are complementary: ChIP-seq identifies binding sites genome-wide, and motif analysis reveals the DNA sequence preferences that drive binding.

Position weight matrices (PWMs) represent transcription factor binding preferences as a \(4 \times L\) matrix of nucleotide frequencies or weights at each position. For a transcription factor with binding motif of length \(L\), the PWM entry \(W(b,i)\) encodes the log-likelihood ratio:

\[S(\text{sequence}) = \sum_{i=1}^{L} \log \frac{P(\text{base}_i \mid \text{PWM}, i)}{P(\text{base}_i \mid \text{background})}\]

The score \(S\) is the log-likelihood ratio of the sequence being generated by the PWM vs. by a background model (typically mononucleotide frequencies). A score of zero means the sequence is equally likely under both models; positive scores indicate a match to the motif.

Two modes of motif analysis:

  1. De novo discovery (MEME, HOMER, STREME): Given a set of peak sequences, find overrepresented motifs without prior knowledge. MEME uses EM to fit a mixture model where some sequences contain the motif and others are background. The output is a set of PWMs ranked by statistical significance (\(E\)-value). De novo discovery is essential for studying poorly characterized transcription factors or species where known motif databases are incomplete.

  2. Known motif enrichment (FIMO, AME): Scan peak sequences against a database of known PWMs (JASPAR, CisBP, HOCOMOCO). For each PWM, compute whether sequences matching that motif are enriched in the peak set relative to a matched background. Known motif analysis can identify which of hundreds of transcription factors with similar binding preferences is most likely driving the ChIP signal.

The relationship between motif strength and binding is not linear. A high-scoring motif match does not guarantee binding: the motif may be in closed chromatin, or the transcription factor may not be expressed in the cell type under study. Conversely, many genuine ChIP-seq peaks contain weak motif matches or no identifiable motif at all — transcription factors can bind through cooperative interactions with other factors that are not captured by the PWM model. The PWM assumes each position contributes independently to binding energy, which is a simplification: nucleotide dependencies, DNA shape, and cofactor interactions all modulate binding affinity. Roughly 30-50% of ChIP-seq peaks for a typical transcription factor lack a detectable canonical motif, reflecting indirect binding via protein-protein interactions rather than direct DNA contact.

Positional enrichment analysis adds signal. Rather than scanning peaks uniformly, compute the spatial distribution of motif occurrences relative to the summit (peak center). For direct-binding transcription factors, the canonical motif should be enriched within ±50 bp of the summit. Loss of this positional enrichment is a signature of indirect binding — the factor is present but not directly contacting DNA at that site.

Layer 4: Chemical Memory — DNA Methylation

DNA methylation (5-methylcytosine at CpG dinucleotides) is the most stable epigenetic mark, persisting through cell division via maintenance methylation by DNMT1, which recognizes hemi-methylated CpGs after replication and methylates the daughter strand. This self-templating mechanism makes methylation a true epigenetic memory — it propagates information across cell generations without changing the DNA sequence.

Bisulfite sequencing (WGBS) treats DNA with sodium bisulfite, which deaminates unmethylated cytosines to uracil while leaving 5-methylcytosine intact. After PCR, unmethylated cytosines appear as thymines, while methylated cytosines remain as cytosines. Methylation level at a single CpG is estimated as:

\[M = \frac{C_{\text{unconverted}}}{C_{\text{unconverted}} + T_{\text{converted}}}\]

where \(C_{\text{unconverted}}\) is the count of C reads and \(T_{\text{converted}}\) is the count of T reads at that CpG position. The binomial confidence interval for \(M\) depends on coverage: at 10x coverage, a CpG with 8 C reads and 2 T reads has \(M = 80\%\) with a wide confidence interval (~45-95%); at 100x coverage, the same proportion has a tight interval (~72-86%). This coverage dependence is why methylation analysis requires either deep coverage or aggregation across CpGs (tiling windows or regional averages).

Reduced representation bisulfite sequencing (RRBS) uses MspI digestion (cuts at CCGG sites) to enrich for CpG-rich regions before bisulfite conversion, reducing sequencing cost by focusing on promoters and CpG islands. RRBS covers ~5% of CpGs genome-wide but concentrates on the CpG islands most relevant for gene regulation. The tradeoff: RRBS misses distal regulatory elements and repetitive regions where methylation plays structural roles, and the MspI bias prevents interrogation of non-CCGG CpGs.

Genomic context determines functional interpretation. CpG islands (CGI, typically defined as >200 bp with >50% GC content and >0.6 observed/expected CpG ratio) in promoter regions are normally unmethylated. Methylation of promoter CGIs is robustly associated with transcriptional silencing. But the direction of causality is not resolved: methylation may silence the gene, or the silent gene may become methylated as a secondary consequence of losing active transcription. This ambiguity cannot be resolved from static methylation measurements — it requires perturbation experiments (dCas9-TET1 for demethylation or dCas9-DNMT3A for de novo methylation) and measurement of the transcriptional response.

Gene body methylation follows a different pattern. Within transcribed regions, methylation is positively correlated with expression level — the opposite of the promoter relationship. This paradox arises because methylation in gene bodies may suppress spurious transcription initiation from cryptic internal promoters. The functional significance of gene body methylation remains an area of active investigation.

Non-CpG methylation (CHG, CHH) is rare in differentiated somatic cells but abundant in plants and in embryonic stem cells. In the human brain, non-CG methylation accumulates during postnatal development and is enriched in synaptic genes, suggesting a regulatory role in neuronal function that is independent of the canonical CpG paradigm.


Biological Interpretation

The four layers form an integrated picture. ChIP-seq identifies the hardware binding events. ATAC-seq reveals which regions of the genome are accessible for binding. Motif analysis reads the software code that determines regulatory logic. DNA methylation records a persistent memory of gene expression state. No single layer is sufficient: accessibility without motif analysis cannot identify the regulator; motif analysis without accessibility finds matches that are never bound; ChIP-seq of one factor reveals its binding but not the broader regulatory landscape; methylation captures persistent states but misses dynamic regulatory changes.

Of these, ChIP-seq is the most fragile. Antibody quality is the dominant factor in data quality, and computational filtering cannot rescue a poor antibody. The IDR framework corrects for technical noise but not for systematic antibody bias. A peak that passes IDR with high statistical significance is reproducible and biologically wrong if the antibody targets the wrong protein. The FRiP score is the simplest quality diagnostic: for a well-characterized transcription factor, FRiP > 5% indicates enrichment; FRiP < 1% indicates failure. This should be checked before any downstream analysis.

ATAC-seq signal is more robust (fewer antibody-related artifacts) but less specific. A differential ATAC peak between conditions identifies a change in accessibility but not the regulator driving the change. ATAC-seq quality is primarily assay-dependent: fragment size distribution (nucleosomal ladder), TSS enrichment (ratio of signal at transcription start sites vs. flanking regions, typically >5 for good data), and the ratio of reads in peaks to total reads. Each metric captures a different failure mode: TSS enrichment indicates successful targeting of regulatory DNA, fragment periodicity indicates correct Tn5 activity, and FRiP-like metrics reflect signal-to-noise.

DNA methylation analysis requires careful attention to context. Promoter CGI methylation correlates with repression, but causality is uncertain. Gene body methylation shows the opposite pattern (positive correlation with expression). Non-CG methylation, when present, is cell-type specific and developmentally regulated. The biological meaning of a methylation change cannot be interpreted without knowing the genomic context and the cell type.


Current Landscape (Q2 2026)

  • CUT&Tag has become the preferred method for histone modifications, particularly for low-input applications (100-1,000 cells). Multi-CUT&Tag enables simultaneous profiling of up to three histone marks in the same cells, resolving combinatorial chromatin states that bulk single-mark assays can only infer.
  • Single-cell epigenomics (scATAC-seq, scCUT&Tag, scMethylation) is revealing regulatory heterogeneity masked by bulk assays. scATAC-seq data is extremely sparse (typical coverage <1% of peaks per cell), driving development of imputation methods and latent semantic analysis adapted from natural language processing.
  • Multi-omic single-cell assays (10x Multiome: scRNA-seq + scATAC-seq from the same nucleus, Paired-Tag: scRNA-seq + scCUT&Tag) link chromatin state to gene expression in the same cell, enabling causal inference of regulatory relationships.
  • Enformer (DeepMind, 2021) and similar sequence-to-function models predict epigenetic tracks (ChIP-seq, ATAC-seq, methylation) directly from DNA sequence, enabling in silico perturbation of regulatory elements without experimental assays. These models are transforming the interpretation of non-coding variants by predicting their impact on regulatory element activity.
  • Base-resolution methylation sequencing with long reads (Nanopore direct methylation detection, PacBio HiFi with enzymatic conversion) is replacing bisulfite sequencing by detecting methylation without the DNA degradation caused by bisulfite treatment, enabling methylation phasing across haplotypes and simultaneous detection of methylation and sequence variation.

Summary and Required Reading

  1. ChIP-seq / CUT&Tag measure protein-DNA binding — signal quality depends on antibody specificity. CUT&Tag requires fewer cells but both require careful antibody validation.
  2. ATAC-seq measures chromatin accessibility — detects open regions but cannot identify which transcription factor binds them. Requires motif analysis or orthogonal ChIP-seq for factor identification.
  3. MACS2 uses Poisson background modeling for peak detection; IDR assesses cross-replicate reproducibility but cannot correct for systematic antibody bias.
  4. Motif analysis (PWMs) reads the regulatory code — de novo discovery for unknown motifs, known motif scanning for factor identification. The PWM independence assumption limits accuracy for cooperative binding.
  5. DNA methylation (bisulfite sequencing) measures a self-propagating epigenetic mark — promoter CGI methylation correlates with repression, but causality is bidirectional. Genomic context determines functional interpretation.

Required Reading

  • Zhang et al.: “Model-based Analysis of ChIP-Seq (MACS)” (Genome Biology, 2008).
  • Li et al.: “Measuring reproducibility of high-throughput experiments” (Annals of Applied Statistics, 2011).
  • Roadmap Epigenomics Consortium: “Integrative analysis of 111 reference human epigenomes” (Nature, 2015).

Johnson’s Rule: A mutation is a hardware defect. An epigenetic change is a software bug.