Chapter 23: Neural Networks and Deep Learning in Biology
Johnson’s First Principle: The Illusion of Magic
A neural network does not “think” — it slides down a gradient of a loss function. The “hidden layers” are not hidden knowledge — they are hierarchical feature transformations. A neural network is a universal function approximator. It will learn the function you train it on — even if that function is noise.
The appeal of deep learning in biology is the promise of automatic feature discovery — letting the model learn relevant patterns from raw data rather than hand-engineering features. Chapter 21 established the risks of \(p > n\) and data leakage. Chapter 22 showed that gradient boosting handles tabular omics data better than neural networks for most biological prediction tasks. So where do neural networks add value? For sequence data (DNA, RNA, protein, images), where patterns are hierarchical and spatially structured, convolutional networks and transformers uncover features that tree-based models cannot see. The key is knowing when deep learning is the right tool and when it is overkill.
Core Concepts
Architecture: Perceptrons, Weights, and Biases
The single neuron (perceptron) computes a weighted sum of its inputs followed by a non-linear activation: \(z = w^T x + b\), \(a = \sigma(z)\). The input vector \(x\) has \(p\) elements (one per feature), \(w\) is the weight vector, \(b\) is the bias, and \(\sigma\) is the activation function.
A layer is \(h\) such neurons, each with its own weight vector and bias. Stacking them creates the matrix form: \(Z = XW + b\). Here \(X\) is the \(n \times p\) input matrix, \(W\) is the \(p \times h\) weight matrix (column \(j\) contains the weights for neuron \(j\)), and \(b\) is the bias vector of length \(h\). Each column of \(Z\) is one neuron’s output across all \(n\) samples; each row is all \(h\) neurons’ outputs for one sample. A fully connected (dense) layer applies the activation element-wise to \(Z\): \(A = \sigma(Z)\).
A deep network stacks these layers. The output of layer \(\ell\) becomes the input to layer \(\ell+1\):
\[H_1 = \sigma_1(XW_1 + b_1) \quad \rightarrow \quad H_2 = \sigma_2(H_1W_2 + b_2) \quad \rightarrow \quad \cdots \quad \rightarrow \quad \hat{y} = \sigma_L(H_{L-1}W_L + b_L)\]
Each layer learns increasingly abstract features. The first hidden layer might detect individual transcription factor binding motifs from raw sequence; the second layer detects combinations of motifs that form cis-regulatory modules; deeper layers detect higher-order regulatory logic. The depth of the network determines the hierarchy of features it can represent.
Activation functions provide the non-linearity without which stacked linear layers collapse to a single linear transformation:
- Sigmoid: \(\sigma(z) = 1/(1+e^{-z})\) outputs probabilities in (0,1), used for binary classification output layers. Suffers from vanishing gradient in hidden layers.
- Tanh: outputs in (-1,1), zero-centered — improves optimization compared to sigmoid but still saturates.
- ReLU: \(\max(0, z)\) — the default hidden-layer activation. Solves vanishing gradient for positive values. The “dying ReLU” problem (neurons that never activate) can occur with poor initialization.
- GELU: a smooth ReLU variant used in transformers — approximates a Gaussian cumulative distribution, providing a non-linear but differentiable activation with better gradient flow.
The Universal Approximation Theorem: a network with one hidden layer and enough neurons can approximate any continuous function on a compact domain to arbitrary precision. This is a statement about existence, not learnability — gradient descent may not find the right weights, and the required number of neurons may be exponentially large.
Backpropagation: The Engine of Learning
A network learns by minimizing a loss function \(L\) (mean squared error for regression, cross-entropy for classification) with respect to its weights. Backpropagation applies the chain rule of calculus to compute the gradient of \(L\) with respect to every weight in the network.
Forward pass — compute predictions by propagating input through the network:
Consider a minimal 2-2-1 network with a single training example: input \(x = [1, 2]\), target \(y = 1\). Let layer 1 weight matrix \(W_1 = [[0.1, 0.3], [0.2, 0.4]]\), biases \(b_1 = [0.1, 0.2]\). Output layer weights \(W_2 = [0.5, 0.6]\), bias \(b_2 = 0.1\). Use sigmoid activation \(\sigma(z) = 1/(1+e^{-z})\) for both layers.
\[z_1 = W_1^T x + b_1 = \begin{bmatrix}0.1(1) + 0.2(2) + 0.1 \\ 0.3(1) + 0.4(2) + 0.2\end{bmatrix} = \begin{bmatrix}0.6 \\ 1.3\end{bmatrix}\]
\[a_1 = \sigma(z_1) = \begin{bmatrix}0.646 \\ 0.786\end{bmatrix}\]
\[z_2 = W_2^T a_1 + b_2 = 0.5(0.646) + 0.6(0.786) + 0.1 = 0.895\]
\[a_2 = \sigma(z_2) = 0.710\]
The network predicts 0.710; the true value is 1. The loss (squared error) is \(L = \frac{1}{2}(1 - 0.710)^2 = 0.042\).
Backward pass — compute how much each weight contributed to the error, using the chain rule from output back to input:
\[\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1}\]
Working backward from the output:
Output layer gradient: \(\frac{\partial L}{\partial a_2} = -(y - a_2) = -0.290\). The derivative of sigmoid is \(\frac{\partial a_2}{\partial z_2} = a_2(1-a_2) = 0.710 \times 0.290 = 0.206\). So \(\frac{\partial L}{\partial z_2} = -0.290 \times 0.206 = -0.060\).
Gradient for output weights: \(\frac{\partial z_2}{\partial W_{2,1}} = a_{1,1} = 0.646\). So \(\frac{\partial L}{\partial W_{2,1}} = \frac{\partial L}{\partial z_2} \times a_{1,1} = -0.060 \times 0.646 = -0.039\). Similarly \(\frac{\partial L}{\partial W_{2,2}} = -0.060 \times 0.786 = -0.047\).
First layer gradient: \(\frac{\partial z_2}{\partial a_{1,1}} = W_{2,1} = 0.5\). So the error backpropagated to the first hidden neuron is \(\frac{\partial L}{\partial a_{1,1}} = \frac{\partial L}{\partial z_2} \times 0.5 = -0.030\). This error then passes through the sigmoid derivative of that neuron (\(a_{1,1}(1-a_{1,1}) = 0.229\)) to give \(\frac{\partial L}{\partial z_{1,1}} = -0.030 \times 0.229 = -0.007\). Finally, \(\frac{\partial L}{\partial W_{1,11}} = \frac{\partial L}{\partial z_{1,1}} \times x_1 = -0.007 \times 1 = -0.007\).
The pattern: each step multiplies the upstream gradient by a local gradient — the derivative of the local operation with respect to its input. This is the chain rule in action: errors propagate backward, scaling at each layer by the activation function’s derivative.
Gradient descent update: \(w = w - \eta \frac{\partial L}{\partial w}\), where \(\eta\) is the learning rate. After this update, the forward pass is recomputed with the new weights, and the process repeats.
The vanishing gradient problem: with sigmoid or tanh activations, the derivative is at most 0.25. Multiplied across 10+ layers, the gradient approaches zero exponentially — early layers receive near-zero updates and stop learning. Solutions: ReLU (derivative is 1 for positive inputs, preserving gradient magnitude), Batch Normalization (keeps activations in a well-behaved range), and residual connections (allow gradients to bypass layers entirely via skip connections).
Optimizers: How Gradients Drive Updates
The gradient \(\partial L/\partial w\) tells you the direction of steepest ascent — but not how far to step. The learning rate \(\eta\) controls step size. Using a single \(\eta\) for all weights assumes all parameters need updates of the same scale, which is false: a poorly initialized weight in one layer may need large corrections while a well-initialized weight in another layer needs fine-tuning.
SGD with momentum accumulates a running average of past gradients (\(v_t = \gamma v_{t-1} + \eta \nabla L\)), smoothing oscillations and accelerating progress through ravines — common in loss landscapes with long, narrow valleys.
RMSProp and AdaGrad normalize each parameter’s learning rate by the root-mean-square of its recent gradients. Parameters with consistently large gradients (steep directions) get smaller steps; parameters with small gradients get larger steps. This prevents oscillations in steep directions while allowing progress in flat directions.
Adam combines both ideas by tracking two running averages:
\[m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(first moment — momentum)}\] \[v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(second moment — per-parameter scaling)}\] \[\theta_{t+1} = \theta_t - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}\]
where \(g_t\) is the gradient at step \(t\). Each parameter effectively has its own adaptive learning rate while maintaining momentum.
AdamW (the recommended default) decouples weight decay from the adaptive learning rate. In standard Adam, applying L2 weight decay through the gradient (\(g_t = g_t + \lambda \theta\)) interacts with the adaptive scaling in complex, unintended ways. AdamW applies weight decay directly to the weights after the update: \(\theta_{t+1} = \theta_{t+1} - \eta \lambda \theta_t\), fixing the interaction and improving generalization.
Regularization for Deep Networks
- L2 weight decay: adds \(\lambda \|w\|^2\) to the loss, shrinking weights toward zero — the primary regularizer for network scale.
- Dropout: randomly sets a fraction of neuron activations to zero during training (typically 0.2-0.5). Prevents co-adaptation of neurons — each neuron must learn independently useful features. At test time, all neurons are used but their outputs are scaled by the dropout rate (inverted dropout).
- Early stopping: monitor validation loss during training; stop when it stops decreasing for a patience window (typically 5-10 epochs). Effectively limits the number of useful training iterations before overfitting.
- Data augmentation: for biological sequences, introducing random mutations (within evolutionary constraints), reverse complement flips, or dinucleotide shuffling during training improves robustness by preventing the network from memorizing exact sequence identities.
Convolutional Neural Networks for Biological Sequences
A dense layer connects every input to every output — no structure is built in. A convolutional layer exploits the fact that nearby elements (adjacent nucleotides, adjacent pixels) are more related than distant ones. The operation is simple: a small weight matrix (the kernel or filter) slides across the input, computing a dot product at each position.
The convolution operation. For a 1D input (sequence) of length \(L\) and a kernel of width \(k\) (typically 12-24 bp for DNA), the output at position \(i\) is:
\[y_i = \sum_{j=0}^{k-1} w_j \cdot x_{i+j}\]
This is a sliding dot product: the kernel aligns with positions \(i\) through \(i+k-1\) of the input, multiplies element-wise, and sums to a single value. The kernel then slides to position \(i+1\) and repeats.
Three properties emerge from this operation:
- Weight sharing: the same kernel is applied at every position. A network with a kernel of width 12 has only 12 weight parameters per kernel, regardless of whether the DNA sequence is 100 bp or 100 Mb. This is the source of CNN’s parameter efficiency.
- Locality: output \(y_i\) depends only on input positions \(i\) through \(i+k-1\). This builds in the prior that relevant patterns are local — a transcription factor binding site is ~6-20 bp, not spanning megabases.
- Translation equivariance: shifting the input by one position shifts the output by one position. The same motif is detected regardless of where it appears in the sequence.
In genomics, a 1D CNN with kernels of width 12-24 bp acts as a motif detector. Each kernel learns a weight matrix that matches a specific sequence pattern (a transcription factor binding motif). After training on ChIP-seq data, the kernels can be visualized as PWMs — the same position weight matrices described in Chapter 12. The key difference is that the CNN learns the motifs from data rather than requiring them to be specified as input. A well-trained first-layer kernel should be interpretable: it should have clear sequence preference at each position, and its consensus sequence should match a known transcription factor motif.
The hierarchy of CNNs is biologically intuitive: first-layer kernels detect short sequence motifs (6-12 bp), second-layer kernels detect combinations of motifs (e.g., “motif A near motif B” — a cis-regulatory module), and deeper layers detect higher-order regulatory logic. This hierarchical feature learning is why CNNs outperformed previous methods for predicting regulatory effects from sequence.
Autoencoders for scRNA-seq
Autoencoders compress high-dimensional data through a bottleneck: encoder (compression) → bottleneck (latent representation) → decoder (reconstruction). The bottleneck forces the model to learn the most informative low-dimensional features. A plain autoencoder learns a deterministic mapping: given input \(x\), the encoder produces a single latent vector \(z\), and the decoder reconstructs \(\hat{x}\). The loss is reconstruction error \(\|x - \hat{x}\|^2\).
The problem with a plain autoencoder: the latent space is not smooth. Points that are close in latent space may not decode to similar outputs because there is no constraint on how the space is organized. Interpolation between two encoded cells may produce biologically meaningless intermediate states.
scVI (single-cell Variational Inference) solves this by using a variational autoencoder (VAE). Instead of encoding each cell to a single latent point, the encoder outputs a probability distribution — a mean \(\mu(x)\) and variance \(\sigma^2(x)\) for each latent dimension. The latent vector \(z\) is sampled from this distribution (not computed deterministically), and the decoder must learn to reconstruct from any sample near \(\mu(x)\), not just from the exact point. This stochasticity forces the latent space to be continuous and structured: nearby latent points must decode to similar expression profiles, enabling meaningful interpolation and batch correction.
The VAE loss function has two terms:
\[\mathcal{L} = \underbrace{\mathbb{E}_{z \sim q}[\log p(x|z)]}_{\text{reconstruction loss}} - \beta \cdot \underbrace{D_{KL}(q(z|x) \| p(z))}_{\text{KL divergence}}\]
The reconstruction loss measures how well the decoder reproduces the input from the sampled latent vector. scVI’s innovation is modeling expression as a negative binomial distribution (Chapter 4) rather than assuming Gaussian-distributed data — counts are overdispersed, and the NB captures the \(\text{Var}(Y) = \mu + \alpha\mu^2\) relationship that Gaussian cannot. The KL divergence measures how closely the encoder’s distribution \(q(z|x)\) matches a standard normal prior \(p(z) = \mathcal{N}(0, I)\). This term penalizes the encoder for deviating from the prior, which keeps the latent space compact and continuous.
The critical caveat: autoencoder compression does not guarantee biological relevance. The bottleneck dimension is a hyperparameter that encodes a strong prior about the data’s intrinsic dimensionality. Too small a bottleneck (e.g., 2 dimensions) discards rare but important cell populations; too large a bottleneck (e.g., 100 dimensions) retains noise. The variational formulation helps (the KL penalty discourages using extra dimensions for noise) but does not eliminate the problem.
scVI corrects for batch effects, library size, and dropout by including these as conditional variables in both the encoder and decoder. The batch effect becomes an input to the model rather than a preprocessing step, enabling principled integration of datasets without the data leakage risks described in Chapter 21.
Biological Interpretation
The critical distinction in biological deep learning: a network with 99.9% training accuracy may have learned nothing about biology — it may have memorized batch effects, patient-specific noise, or technical artifacts. Always validate latent representations by checking if they preserve known biological structure (cell types, known pathways) before trusting them for discovery.
The \(p \gg n\) problem is extreme in drug response prediction: 20,000 genes \(\times\) 100 drugs, but only hundreds of cell lines with measured responses. A model that performs well on cross-validation within a single dataset almost never generalizes to an independent cohort. The 2025 reproducibility crisis in DL pharmacogenomics found that most published deep learning biomarkers for drug response do not replicate in independent cohorts. External validation is not optional — it is the only honest assessment.
When CNNs are applied to biological sequences, the learned kernels should be interpretable as sequence motifs. If the first-layer kernels do not resemble known transcription factor binding motifs (or show no clear sequence preference), the network may have learned artifactual patterns. Visualizing and validating learned features is the biological analogue of feature importance analysis in tree-based models (Chapter 22).
Current Landscape (Q2 2026)
- scVI-tools ecosystem continues to expand (scArches for model transfer across datasets, totalVI for CITE-seq protein + RNA). The variational inference paradigm dominates single-cell deep learning for batch correction and integration.
- New architectures for single-cell: scTab (2025) uses tabular transformers for cell type annotation, challenging the VAE dominance by treating cells as rows in a table rather than compressing through a bottleneck.
- Drug response prediction: CELLxGENE Discover Census now provides millions of single-cell profiles for pretraining, enabling transfer learning approaches for pharmacogenomics that outperform per-dataset training by learning generalizable drug-response features.
- Critical 2025 result: a systematic benchmark of deep learning methods for drug response prediction found that most published methods fail to generalize to held-out cell lines or drugs, with simple baselines (elastic net) often matching or exceeding complex neural network architectures.
Summary and Required Reading
- Neural networks are universal function approximators — but approximating a function and learning the correct one from limited data are different things.
- Backpropagation applies the chain rule — each step multiplies the upstream gradient by a local gradient; numerical examples make the pattern concrete.
- Optimizers adapt the learning process — Adam (momentum + per-parameter scaling) and AdamW (decoupled weight decay) are the default choices because different parameters need different update scales.
- CNNs detect motifs via sliding dot product convolution — weight sharing, locality, and translation equivariance make them ideal for biological sequences; first-layer kernels should be interpretable as PWMs (Chapter 12).
- VAEs (scVI) structure the latent space for single-cell data — the variational formulation with NB output enables principled batch correction and interpolation, but compression does not guarantee biological relevance.
Required Reading
- Goodfellow, Bengio & Courville: Deep Learning (Ch 6-8 — the fundamentals).
- Lopez et al.: “Deep generative modeling for single-cell transcriptomics” (Nature Methods, 2018) — scVI.
Johnson’s Rule: If your data has fewer than 1000 samples, use a tree model. Neural networks need scale to beat trees on tabular data.