Chapter 16: Systems Biology and Network Graphing

Johnson’s First Principle: No Gene Acts Alone

A protein that physically interacts with 100 other proteins is biologically more important than a protein that interacts with 3, regardless of fold-change. A transcription factor at the junction of DNA damage sensing and apoptosis is central to chemotherapy response, even if its expression never changes. Phenotype emerges from network interactions, not isolated gene lists.

The differential expression analysis from Chapters 4 and 6 produces lists of genes ordered by p-value. But biology does not operate through lists — it operates through networks. A gene with moderate fold-change but high network centrality may be more biologically relevant than the most differentially expressed gene, which could be a downstream bystander. This chapter shifts the unit of analysis from individual genes to the relationships between them.

Core Concepts

Network Theory: Nodes, Edges, and Centrality

A biological network \(G = (V, E)\) consists of nodes \(V\) (genes, proteins) and edges \(E\) (interactions). Edge types vary by data modality: physical binding (protein-protein interaction), co-expression correlation (transcriptomics), genetic interaction (CRISPR screens), or phosphorylation (kinase-substrate).

Three centrality metrics quantify node importance in fundamentally different ways, and each has a distinct biological interpretation:

Degree: The number of edges incident to a node. High-degree nodes (hubs) are connected to many partners. Biological meaning: Essential genes tend to have higher degree. Knockout of a hub is more likely to be lethal because it affects more pathways simultaneously. Degree is the simplest metric but highly sensitive to literature bias — well-studied proteins have more reported interactions regardless of biological centrality.
Betweenness: The fraction of shortest paths passing through a node. High-betweenness nodes are bottlenecks — their removal fragments the network into disconnected components. Biological meaning: Bottlenecks are often signaling integrators (e.g., TP53, AKT1) that sit at the convergence of multiple upstream pathways and diverge to multiple downstream effectors. A bottleneck protein is a particularly effective drug target because perturbing it disrupts the information flow through the entire pathway.
Eigenvector centrality: A recursive definition — a node is important if it is connected to other important nodes (the PageRank analogue). Biological meaning: Eigenvector centrality captures position within the network hierarchy. A transcription factor connected to highly central signaling proteins ranks higher than one connected to isolated peripheral proteins, even if their degrees are equal.

Weighted Gene Co-expression Network Analysis (WGCNA)

WGCNA (Langfelder & Horvath, 2008) constructs a gene co-expression network from expression data. Unlike PPI networks which use curated physical interactions, WGCNA infers networks from correlations in expression data across samples:

Correlation matrix: Compute pairwise Pearson correlations between all gene pairs across samples. For 20,000 genes, this produces a 20,000 \(\times\) 20,000 correlation matrix — 200 million correlations.
Soft thresholding: Raise the absolute correlation to a power \(\beta\) (typically 6-12): \(a_{ij} = |\text{cor}(x_i, x_j)|^\beta\). This suppresses weak correlations toward zero while preserving the rank order of moderate-to-strong correlations. Why not use a hard threshold (keep only correlations > 0.8, set others to zero)? A hard threshold creates a discontinuity: a correlation of 0.79 is zero while 0.81 is full strength, which is biologically arbitrary and makes the network unstable to small perturbations. Soft thresholding preserves the continuous nature of co-expression.

The power \(\beta\) is not an arbitrary parameter. WGCNA evaluates a range of powers (typically 1-20) and selects the lowest \(\beta\) where the network approximately follows scale-free topology, assessed by the \(R^2\) of the linear regression \(\log_{10} P(k) \sim \log_{10} k\) — the degree distribution should follow a power law \(P(k) \sim k^{-\gamma}\) with \(R^2 > 0.8\). Powers below this threshold produce networks that are too dense (all genes appear connected) to reveal modular structure. Higher \(\beta\) punishes weaker correlations more aggressively, producing smaller, tighter modules at the cost of potentially fragmenting genuine biological pathways.

The degree distribution diagnostic plot (log-log of degree vs. frequency) is the first quality check in any WGCNA analysis. A downward-curving rather than linear plot indicates the data lacks scale-free structure, and WGCNA modules may be artifactual — this is common in datasets with strong batch effects or low sample sizes (typically \(n < 15\)).

Topological overlap matrix (TOM): Transform the weighted adjacency matrix \(A = [a_{ij}]\) to TOM, where the similarity between two genes accounts for their shared neighbors. Formally:

\[w_{ij} = \frac{l_{ij} + a_{ij}}{\min(k_i, k_j) + 1 - a_{ij}}, \quad l_{ij} = \sum_{u} a_{iu} a_{uj}, \quad k_i = \sum_{u} a_{iu}\]

where \(a_{ij} = |\text{cor}(x_i, x_j)|^\beta\) is the soft-thresholded adjacency, \(l_{ij}\) sums the products of adjacency weights through all shared neighbors \(u\) (indirect connectivity), and \(k_i\) is the total connectivity of node \(i\) (the sum of all edge weights incident to it). The numerator combines direct adjacency (\(a_{ij}\)) with indirect connectivity through shared neighbors (\(l_{ij}\)). The denominator normalizes by the minimum connectivity of the two nodes, preventing hub genes from achieving artificially high TOM with all network members simply because they are highly connected.

Biologically, TOM captures the idea that genes participating in the same pathway are connected through a chain of intermediate genes — the direct correlation may be weak, but if they share most of their network neighbors, they are functionally related. A concrete example: two genes with direct correlation of only 0.3 but each co-expressed with the same 15 intermediate pathway genes will have a high TOM, correctly placing them in the same module despite the weak direct correlation.

Hierarchical clustering: Cluster TOM dissimilarity (\(1 - \text{TOM}\)) into modules. Each module is a group of genes with high topological overlap, interpreted as a pathway or functional unit.
Module-trait correlation: Reduce each module to its eigengene (first principal component of all module genes’ expression), then correlate the eigengene with external traits (disease status, treatment response). This identifies which modules are disease-relevant.

The eigengene is the first principal component (PC1) of the module’s expression matrix, directly applying the SVD framework from Chapter 5. By the Eckart-Young theorem, PC1 captures the maximum possible variance explained by a single vector — it is the optimal one-dimensional summary of the module’s coordinated expression pattern. The fraction of total variance explained by the eigengene (the first singular value squared divided by the sum of all squared singular values) quantifies module coherence: a module whose eigengene explains >50% of its total variance is highly coordinated, while one explaining <30% is loosely associated and warrants examination for spurious clustering of unrelated genes.

The scale-free assumption. WGCNA assumes biological networks follow a scale-free topology: \(P(k) \sim k^{-\gamma}\), where a few hubs dominate. This is approximately true for many but not all biological networks. A network with uniform degree distribution (e.g., a metabolic pathway where every enzyme interacts equally with its immediate neighbors) is not scale-free, and WGCNA’s soft thresholding will produce spurious modules in such cases. The scale-free fit (\(R^2 > 0.8\)) should be verified before interpreting modules.

Protein-Protein Interaction Networks

PPI networks (STRING, BioGRID) catalog physical protein-protein interactions from experimental and computational sources. STRING scores interactions from 0-1, integrating evidence from co-expression, co-occurrence in the literature, text mining, and high-throughput screens. The score is not a confidence in a specific interaction but an aggregate of heterogeneous evidence — two proteins may have a STRING score of 0.9 because they are consistently co-expressed, not because they physically bind.

Network propagation (HotNet2). Given a set of seed genes (e.g., significantly mutated genes from a cancer study), propagate signal through the network to identify additional genes that are functionally connected:

\[F^{t+1} = \alpha A F^t + (1 - \alpha) F^0\]

where \(F\) is the heat vector (node relevance), \(A\) is the normalized adjacency matrix, and \(\alpha\) is the restart probability (typically 0.5-0.8). At each step, a fraction \(\alpha\) of the heat diffuses to neighboring nodes, and \((1 - \alpha)\) returns to the seed nodes. After convergence, nodes with high heat are prioritized as network-influential, even if they were not in the seed set.

The biological rationale: mutated cancer genes form subnetworks of interacting proteins that collectively drive tumorigenesis. A gene that is not itself mutated but is highly connected to mutated genes may be functionally important — its pathway is disrupted even if its sequence is not. Network propagation identifies such “passenger-by-proximity” genes.

Guilt-by-Association

Uncharacterized genes are assigned function based on their network neighbors. If a gene of unknown function is connected to three ribosomal proteins, it is likely involved in translation. This principle is the basis for most automated functional annotation pipelines (GO term prediction, gene function databases).

The limitation is circularity: well-studied genes have more known interactions and will appear as higher-priority nodes regardless of biological relevance. A gene connected to TP53 is not necessarily functionally related to TP53 — it may be connected because TP53 is the most studied protein in biology and interacts with many partners. Network propagation amplifies this bias: seeds are typically well-studied genes, and heat propagates to their well-studied neighbors. Comparing results against degree-matched controls (randomly selecting hubs as seeds) quantifies the extent of literature bias in the propagation result.

Biological Interpretation

The hub gene of a disease module is a better drug target than the most statistically significant gene — this is the central insight of systems biology. A gene with moderate fold-change but high network centrality may be more biologically relevant than the most differentially expressed gene, which could be a downstream bystander. But this principle presupposes that the network reflects functional biology, not literature bias.

WGCNA modules should be validated for biological coherence (enrichment of known pathways, consistency across independent datasets) before interpretation. A module that correlates with disease but contains genes from multiple unrelated pathways may be driven by a confound (batch effect, sample quality) rather than biology. The module eigengene-trait correlation is a single number with no inherent biological meaning — it must be interpreted in the context of the module’s functional annotation.

Network propagation amplifies signal but also amplifies bias: well-studied genes accumulate more known interactions and consistently rank higher in propagation analyses. This literature bias is systematic and must be accounted for by comparing against degree-matched controls. A propagation result that ranks only well-studied genes should be interpreted as a validation of known biology, not a discovery.

The contrast with Chapter 6 is instructive. Differential expression identifies individual genes by statistical significance; network analysis identifies functional modules by connectivity. These approaches are complementary: differential expression identifies the components that change, and network analysis reveals the system architecture within which those changes operate. A differentially expressed gene that is a network hub is more biologically interesting than one that is isolated.

Current Landscape (Q2 2026)

Multi-omic network integration (MOFA+, WGCNA on multi-omic data) identifies conserved modules across transcriptomic, proteomic, and metabolomic layers, revealing pathways that are coherently regulated at multiple molecular levels — a gene whose RNA, protein, and metabolite levels all co-vary is more likely to be functionally important.
Single-cell network inference (SCENIC+, Pando) infers transcription factor regulatory networks from scRNA-seq data using cis-regulatory motif information and co-accessibility, moving beyond bulk co-expression to cell-type-specific regulatory architectures.
LLM-based network curation (GPT-4-annotated interactions from literature) is augmenting STRING but introduces hallucination risks — a literature-mining model that invents interactions based on plausible text patterns will produce statistically clean but biologically wrong networks.
Causal network inference (Perturb-seq data + DAG learning) is replacing correlation-based network methods by using genetic perturbations to identify causal regulatory relationships, moving from “gene A correlates with gene B” to “gene A regulates gene B.”

Summary and Required Reading

Network centrality metrics (degree, betweenness, eigenvector) quantify node importance in fundamentally different ways — each has a distinct biological interpretation.
WGCNA constructs weighted co-expression networks via soft thresholding (preserving continuous edge weights) and TOM (capturing shared-neighbor relationships), identifying disease-relevant modules.
Network propagation prioritizes genes connected to known drivers via heat diffusion, but amplifies literature bias — well-studied genes consistently rank higher.
Guilt-by-association assigns function from network neighbors — powerful for novel genes but circular for well-studied ones.
Differential expression and network analysis are complementary — DE identifies the components that change; networks identify the system architecture they operate within.

Required Reading

Langfelder & Horvath: “WGCNA: an R package for weighted correlation network analysis” (BMC Bioinformatics, 2008).
Szklarczyk et al.: “STRING v11: protein-protein association networks” (Nucleic Acids Research, 2019).

Johnson’s Rule: The hub gene of a disease module is a better drug target than the most statistically significant gene.