Chapter 14: Genome-Wide Association Studies (GWAS) and Statistical Genetics

Published

June 5, 2026

Modified

June 19, 2026

Johnson’s First Principle

Correlation between a genetic variant and a disease does not prove causation. Linkage disequilibrium means that the associated variant is almost never the causal variant; it is merely in statistical proximity to it. The causal variant must be fine-mapped. Before you can interpret a GWAS result, you must understand that a “hit” is a chromosomal region with correlated alleles, not a confirmed biological mechanism.

Core Concepts

Linkage Disequilibrium and Haplotype Blocks

The human genome is not a collection of independent variants. Nearby SNPs are inherited together far more often than expected by chance, because recombination events are rare and localized to hotspots. This non-random co-inheritance is called linkage disequilibrium (LD).

Two LD statistics:

\(r^2\): Squared Pearson correlation between allele frequencies at two loci. \(r^2 = 1\) means perfect co-inheritance. Used in imputation accuracy assessment.
\(D'\): Normalized covariance relative to the theoretical maximum. \(D' = 1\) means no observed recombinant haplotypes in the sample.

LD structure defines haplotype blocks: chromosomal intervals within which recombination is suppressed. A single lead SNP in a GWAS may tag an entire block of hundreds of correlated variants. Fine-mapping is required to resolve which variant within the block is functional.

GWAS Study Design and Association Testing

A GWAS tests the statistical association of each of 500,000 to 10,000,000 SNPs with a phenotype of interest.

Case-control design: Binary trait (disease/healthy). Logistic regression: \(\text{logit}(P(\text{disease})) = \beta_0 + \beta_1 \cdot \text{genotype}_i + \boldsymbol{\gamma} \cdot \mathbf{C}\), where \(\mathbf{C}\) is a matrix of covariates (age, sex, ancestry PCs).
Quantitative trait: Continuous phenotype (e.g., BMI, LDL, gene expression). Linear regression: \(Y = \beta_0 + \beta_1 \cdot \text{genotype}_i + \boldsymbol{\gamma} \cdot \mathbf{C} + \epsilon\).

Genotype is coded as dosage (0, 1, 2) under an additive model, assuming the effect of two copies is twice the effect of one.

Population Stratification

Genetic ancestry is a confounder in GWAS: it correlates simultaneously with disease prevalence (due to environmental and lifestyle factors differing across populations) and with allele frequencies. Without correction, ancestry will drive spurious associations; the statistical signal of “this allele is associated with disease” is actually “this allele is more common in the ancestry group with higher disease prevalence.”

Correction strategy:

Compute genotype PCA from LD-pruned SNPs. The top 10 principal components capture the major ancestry axes.
Include PC1–PC10 as covariates in all regression models.
Assess residual inflation using the genomic control inflation factor \(\lambda_{GC} = \text{median}(\chi^2_{\text{observed}}) / 0.456\). \(\lambda_{GC} > 1.1\) for a study of standard size signals residual confounding.

Multiple Testing and Genome-Wide Significance

Testing \(\sim10^6\) independent SNPs requires a Bonferroni-corrected significance threshold of:

\[P < 5 \times 10^{-8} \approx \frac{0.05}{10^6}\]

The standard \(P = 5 \times 10^{-8}\) threshold is conservative for studies using imputed genomes with higher SNP counts, and liberal for studies restricted to targeted arrays. A Manhattan plot visualizes the \(-\log_{10}(P)\) landscape across all chromosomes; a Q-Q plot compares observed vs. expected \(-\log_{10}(P)\) values to diagnose inflation.

Fine-Mapping and Causal Variant Identification

A GWAS lead SNP identifies a genomic region, not a causal variant. Fine-mapping uses Bayesian methods (SuSiE, FINEMAP) to compute a credible set: the smallest set of variants that contains the causal variant with 95% posterior probability. LocusZoom visualizes the LD structure and association statistics around a locus.

The causal variant is often not the protein-coding variant. Most GWAS signals map to non-coding regulatory elements (promoters, enhancers) and affect gene expression rather than protein function.

Expression Quantitative Trait Loci (eQTL)

eQTL mapping identifies genetic variants that affect gene expression levels in specific tissues. GTEx (Genotype-Tissue Expression) provides tissue-specific eQTL maps across 54 human tissues.

Colocalization tests whether a GWAS signal and a tissue-specific eQTL share the same causal variant, providing mechanistic insight into how the risk allele alters biology (coloc, SMR). A GWAS hit that colocalizes with a brain cortex eQTL for a specific gene provides strong evidence that the gene mediates the disease association through regulatory effects in that tissue.

Polygenic Risk Scores

The vast majority of complex disease heritability is distributed across thousands of common variants, each with tiny individual effects. Polygenic Risk Scores (PRS) aggregate this distributed signal:

\[\text{PRS} = \sum_{i=1}^{M} \hat{\beta}_i \cdot g_i\]

where \(\hat{\beta}_i\) is the GWAS effect size estimate and \(g_i\) is the allele dosage at SNP \(i\). PRS are now clinically validated for coronary artery disease, breast cancer, and type 2 diabetes risk stratification.

Biological Interpretation

A GWAS hit is a chromosomal region, not a gene. The closest protein-coding gene to the lead SNP is rarely the causal gene; enhancer-promoter chromatin loops can bridge a variant to a target gene 500 kb or more away. Interpreting a GWAS hit requires integration with tissue-specific eQTL databases, chromatin accessibility (ATAC-seq, open chromatin enrichment), and 3D genome contact maps (Hi-C TAD boundaries).

The Reviewer Mindset: Population stratification is the primary confounding threat in GWAS. The Q-Q plot and genomic inflation factor (\(\lambda_{GC}\)) are the first quality checkpoints. \(\lambda_{GC} > 1.1\) for a study of standard size indicates residual stratification or cryptic relatedness, not genuine polygenic signal. Polygenic Risk Scores demonstrate predictive value within the ancestry group in which they were trained. Applying a European-ancestry PRS to African-ancestry populations loses 30–80% of predictive power because LD structure and allele frequencies differ systematically across ancestry groups; a critical equity limitation of current PRS approaches. This is not a minor technical footnote: it represents a structural health disparity, because African-ancestry individuals are simultaneously most underrepresented in GWAS discovery cohorts and least well-served by the resulting risk scores.

Summary and Required Reading

GWAS identifies genomic regions associated with complex traits by testing millions of variants simultaneously. The primary analytical challenges are population stratification, multiple testing correction, and interpreting statistical associations as regulatory mechanisms rather than protein-coding mutations. Fine-mapping and eQTL colocalization are required to move from statistical signal to biological mechanism.

Required Reading:

Visscher et al.: “10 years of GWAS discovery: biology, function, and translation” (American Journal of Human Genetics, 2017).
Price et al.: “Principal components analysis corrects for stratification in genome-wide association studies” (Nature Genetics, 2006).
Martin et al.: “Clinical use of current polygenic risk scores may exacerbate health disparities” (Nature Genetics, 2019).