Chapter 20: Multi-Omics Integration

Published

June 5, 2026

Modified

June 19, 2026

Johnson’s First Principle: Integrating Bad Data from Two Assays Yields Twice as Much Bad Data

Intersecting lists of statistically significant genes from RNA-seq and proteomics does not integrate data; it compounds error rates. The overlap is dominated by the most highly expressed, easiest-to-measure genes, not by the biologically most relevant ones. True integration requires dimensionality reduction algorithms that mathematically identify shared variance between molecular layers, independent of arbitrary significance thresholds.

A gene with p-value \(10^{-10}\) in RNA-seq and p-value \(10^{-10}\) in proteomics may appear to be a biologically validated hit. But if the top 100 genes from each dataset overlap by only 20 genes, and those 20 are the most abundant housekeeping genes, the “validated” list is dominated by measurement convenience rather than biological importance. This chapter covers the methods that avoid this trap.

Core Concepts

The Three Axes of Integration

Horizontal integration: Same data modality, different batches or cohorts (e.g., RNA-seq dataset A + RNA-seq dataset B). Challenge: batch effects (Chapter 6). Correction: ComBat, Harmony, scVI for single-cell data. This is the simplest integration axis because all data share the same distributional assumptions.
Vertical integration: Different modalities, same biological samples (e.g., RNA-seq + ATAC-seq + proteomics on the same cells). Challenge: different data distributions (Poisson for RNA-seq, Gaussian for log-processed proteomics, binomial for ATAC-seq) and different missingness mechanisms (dropout in scRNA-seq vs. detection limits in proteomics). This is the axis where MOFA and CCA are most useful, because the shared sample structure constrains the solution.
Diagonal integration: Different modalities, different samples (e.g., scRNA-seq from one study + spatial transcriptomics from another). Challenge: no paired measurements exist, so integration relies on shared features (genes) and assumed biological similarity between datasets. Methods must infer a joint latent space without any direct correspondence between samples. The most challenging and most common real-world integration scenario; no experiment measures everything in every cell.

Selecting the Integration Method

The choice of integration method depends primarily on whether you have paired samples and whether you seek shared or complementary variation:

If you have paired samples and want shared variation: MOFA+ factorizes the data into shared and modality-specific factors, with interpretable variance decomposition per factor per modality. The output tells you “how much of RNA variance and how much of protein variance is explained by this shared factor.”
If you have paired samples and want correlational structure: CCA finds linear combinations that are maximally correlated across modalities. The output tells you “which RNA and protein measurements co-vary across samples.”
If you have unpaired samples (diagonal integration): Seurat’s anchor-based CCA or optimal transport methods (FGOT, MultiVIT) infer a matching between datasets based on shared features. These do not require any paired observations.
If you need causal inference: eQTL or pQTL analysis (below) uses genetic variants as instrumental variables.

Canonical Correlation Analysis (CCA)

CCA finds linear combinations of variables from two datasets that are maximally correlated with each other. For two data matrices \(X\) and \(Y\) (e.g., RNA expression of 20,000 genes and ATAC peaks at 100,000 regions across the same 100 samples), CCA finds vectors \(w_x\) and \(w_y\) such that:

\[\rho = \max_{w_x, w_y} \frac{w_x^T X^T Y w_y}{\sqrt{w_x^T X^T X w_x} \sqrt{w_y^T Y^T Y w_y}}\]

The numerator \(w_x^T X^T Y w_y\) measures the cross-covariance between the two linear combinations; the denominator normalizes by the variance of each combination. The result \(\rho\) is the canonical correlation; the Pearson correlation between the two projections.

The \(p \gg n\) problem. The CCA objective requires inverting \(X^T X\) and \(Y^T Y\), which are \(p \times p\) covariance matrices. With 20,000 genes and 100 samples, these matrices are rank-deficient (singular); the optimization has no unique solution because the sample covariance cannot reliably estimate \(p(p+1)/2 \approx 200\) million unique entries from only 100 observations. In practice, this is resolved either by applying PCA to each modality first (reducing each to 30-50 PCs, as established in Chapter 5, then running CCA on the PC scores) or by using regularized CCA with a ridge penalty to stabilize the covariance estimates. The Seurat integration pipeline uses the PCA-then-CCA approach: it decomposes each dataset via SVD, projects into PC space, and finds canonical correlates in the compressed latent space where \(n > p\).

The housekeeping trap. CCA components with near-perfect cross-modality correlation (\(\rho > 0.9\)) often reflect technical artifacts rather than biological coordination. If both modalities are dominated by abundant housekeeping genes (GAPDH, ACTB), the first CCA component may simply capture “both measurements are working correctly” rather than a biological signal. The more informative components are typically the ones with moderate correlation (\(\rho \sim 0.3-0.7\)) that capture specific biological processes.

Multi-Omics Factor Analysis (MOFA)

MOFA (Argelaguet et al., 2020) extends factor analysis to multiple data modalities:

\[Y_m = ZW_m^T + \epsilon_m\]

where \(Y_m\) is the data matrix for modality \(m\), \(Z\) is the shared factor matrix (across all modalities), \(W_m\) is the modality-specific weight matrix, and \(\epsilon_m\) is modality-specific noise. The factors \(Z\) capture the sources of variation shared across data types. Each factor is a latent variable that explains a certain amount of variance in each modality.

The sparsity prior that makes MOFA interpretable. MOFA places an automatic relevance determination (ARD) prior on each weight column \(W_{:, k}^{(m)}\), which shrinks irrelevant factors toward zero in a modality-specific manner. A factor that captures biological variation present in RNA but absent in ATAC will have large weights in the RNA modality and near-zero weights in the ATAC modality; the ARD prior automatically prunes it from ATAC. This is what produces the interpretable variance decomposition that distinguishes MOFA from standard factor analysis, where every factor loads on every modality with no built-in sparsity mechanism.

Key outputs: 1. Variance decomposition per factor per modality: “Factor 1 explains 30% of RNA variance and 50% of ATAC variance.” This tells you which modalities contribute to each biological process. 2. Factor values per sample: Each sample gets a score for each factor, enabling clustering, time-series analysis, or association with clinical outcomes. 3. Feature weights per factor: For each factor, the genes, peaks, or proteins with the highest absolute weights are the ones driving that factor.

The dominant-factor caveat. The first factor in any MOFA analysis is almost always a technical confound (batch, age, sequencing depth, sample quality). This is not a bug; it reflects the reality that technical variation often dominates biological variation across modalities. Before interpreting factors biologically, verify that they correlate with the variable of interest and not with known confounders. The number of truly informative biological factors is typically small (3-10), regardless of the number of input modalities.

eQTL and pQTL: Genetic Anchors for Causal Inference

Expression quantitative trait loci (eQTL) use genetic variants as causal anchors to infer that gene expression changes are causal rather than confounded. The logic: if a SNP is associated with both gene expression (as an eQTL) and disease (as a GWAS hit), and the eQTL effect is plausibly causal (the variant affects a regulatory element), then the gene is a candidate causal mediator of the disease association. This is Mendelian randomization; the genetic variant is randomized at conception, independent of confounders like lifestyle or environment.

cis-eQTL vs. trans-eQTL. A cis-eQTL affects expression of a nearby gene (typically within 1 Mb of the variant). These are the most common and interpretable; the variant likely lies in a promoter, enhancer, or other regulatory element of that gene. A trans-eQTL affects expression of a distant gene (different chromosome or >1 Mb away). These are rarer, harder to validate, and typically have smaller effect sizes, but they reveal regulatory networks that cross chromosomal boundaries.

The GTEx consortium provides the most comprehensive eQTL atlas: ~50 tissues, >800 donors, millions of cis-eQTL associations. The key finding: most GWAS variants are enriched in non-coding regulatory regions and act through gene expression regulation, not protein-coding changes. For a typical complex disease, ~60-80% of GWAS loci have a nearby eQTL in the relevant tissue, implicating gene expression dysregulation as the causal mechanism.

pQTL extends this logic to proteins. Protein QTLs are more directly relevant to phenotype (proteins are the functional molecules) but are harder to identify due to smaller sample sizes in proteomics studies and higher measurement variability. The overlap between eQTL and pQTL signals is only ~20-40%; consistent with the mRNA-protein decoupling established in Chapter 17. A variant that affects protein abundance without affecting mRNA is evidence of post-transcriptional regulation.

Mendelian randomization assumptions. The causal interpretation of eQTL-mediated GWAS colocalization rests on three assumptions that are violated more often than satisfied: (1) no horizontal pleiotropy: the genetic variant affects the disease only through the gene, not through independent pathways; (2) no linkage disequilibrium confounding: the true causal variant is the eQTL, not a correlated variant in LD that affects disease through a different gene; and (3) reliable exposure prediction: the variant is a strong predictor of gene expression (the \(F\)-statistic for the variant-expression association exceeds 10). When these assumptions fail, colocalization produces a false causal inference: the variant and the gene expression appear causally linked to the disease when they are independent correlates of a hidden third factor.

Biological Interpretation

Before integrating, ask: does this integration solve a biological question that single-omics cannot? The most successful multi-omics studies ask questions that are inherently multi-modal: “Which chromatin accessibility changes drive expression changes?” (requires ATAC + RNA), “Which post-translational modifications regulate protein activity independently of expression?” (requires proteomics + phosphoproteomics), or “Which genetic variants regulate gene expression in disease-relevant tissues?” (requires GWAS + eQTL).

CCA components with near-perfect cross-modality correlation should be examined critically; they may reflect technical artifacts (sample quality, batch, sequencing depth) rather than biological coordination. The number of truly informative factors is typically small (3-10), regardless of the number of modalities.

The dominant factor in any multi-omics analysis is almost always a technical confound (batch, age, sequencing depth). Before interpreting factors biologically, verify that they correlate with the variable of interest and not with known confounders. A MOFA factor that strongly separates cases from controls but also separates batch 1 from batch 2 is dominated by batch; the biological and technical signals are confounded, and the factor cannot be interpreted as biological.

Summary and Required Reading

Three integration axes: horizontal (same modality, different batches) uses batch correction (ComBat, Harmony); vertical (different modalities, same samples) uses MOFA/CCA; diagonal (different modalities, different samples) uses optimal transport or anchor-based methods.
CCA finds maximally correlated linear combinations; the foundation of Seurat integration; caution: first components often capture technical artifacts, not biology.
MOFA factorizes variance into shared and modality-specific components with interpretable variance decomposition per factor per modality.
eQTL/pQTL use genetic variants as causal anchors linking genotype to molecular phenotype; the ~20-40% overlap between eQTL and pQTL signals reflects post-transcriptional regulation.
The dominant factor is almost always technical: verify factor-biological associations are not driven by batch, depth, or sample quality before biological interpretation.

Required Reading

Argelaguet et al.: “MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data” (Genome Biology, 2020).
GTEx Consortium: “The GTEx Consortium atlas of genetic regulatory effects across human tissues” (Science, 2020).

Johnson’s Rule: If this integration does not solve a question that single-omics cannot, you are adding complexity, not insight.