Chapter 14: Trajectory Inference and RNA Velocity

Johnson’s First Principle: Biology is a Continuum

A hematopoietic stem cell does not instantaneously transform into a red blood cell. It passes through thousands of intermediate transcriptomic states. If you force developing cells into discrete clusters, you destroy the timeline of their evolution. Chapter 13 showed how to identify cell types; this chapter addresses a deeper question: within a cell type, how do cells transition from one state to another?

The challenge is observational: you have a single snapshot — tissue from a single time point — but you need temporal information. Pseudotime and RNA velocity recover this temporal dimension from static data by exploiting two different signals: transcriptional similarity between cells (pseudotime) and the kinetics of RNA splicing (velocity).

Core Concepts

Pseudotime: Ordering Cells Without a Clock

Pseudotime algorithms order cells along a developmental trajectory using transcriptional similarity, without knowledge of real time. The fundamental assumption is that cells captured at different points along a continuous process can be ordered by their expression profiles: a cell halfway through differentiation expresses a characteristic set of intermediate genes that places it between the starting and ending states.

Minimum Spanning Trees (Monocle). Builds a KNN graph in reduced-dimension space and finds the shortest path connecting all cells (the minimum spanning tree). With user-specified root cells (the “start”), pseudotime is the distance along the MST from the root to each cell. Changing the root changes all downstream interpretations — the root assumption is the critical user decision. Monocle also learns the “ordering” genes (those that change along the trajectory) as part of the algorithm, which can create circularity: the selected genes determine the trajectory, and the trajectory determines which genes are selected. This is not a bug — it is the algorithm’s best inference — but it means that different initial gene selections can produce different trajectories from the same data.

Diffusion maps (destiny). Models cell-to-cell transitions as a random walk. Cells connected by many short paths are close in diffusion pseudotime. This approach is more robust to noise than MST-based methods because it considers all paths, not just the single shortest. Diffusion maps are particularly effective for datasets with continuous gradients (e.g., cell cycle) rather than discrete branching (e.g., differentiation with two daughter fates).

Sliding principal curves (Slingshot). Simultaneously identifies multiple trajectories and branch points. The algorithm constructs a principal curve through the data (the path of maximal variance) and projects each cell onto the curve for its pseudotime value. For branching lineages, Slingshot first identifies clusters, then constructs a minimum spanning tree between clusters, and finally fits principal curves along each branch. This hybrid approach (clustering followed by continuous trajectory) bridges the discrete-to-continuous gap that Chapter 13’s clustering framework leaves open.

The root assumption. Every pseudotime method requires a root — a cell or set of cells designated as the starting point. The root is typically identified from prior biological knowledge (e.g., the earliest developmental stage in the dataset) or from expression of known marker genes. The critical caveat: different root choices produce different pseudotime orderings, and there is no computational way to validate the correct root from the expression data alone. RNA velocity (below) removes this requirement by inferring directionality from splicing kinetics.

RNA Velocity: The Physics of Splicing

Pseudotime orders cells but cannot determine direction — it requires the user to specify a root. RNA velocity solves this by adding a directional signal from RNA splicing kinetics.

RNA velocity infers transcriptional direction from the balance of unspliced pre-mRNA and spliced mRNA. When a gene is activated, transcription produces unspliced pre-mRNA, which is then spliced into mature mRNA and eventually degraded. A high ratio of unspliced to spliced mRNA for a given gene means transcription has recently increased — unspliced molecules have accumulated and have not yet been fully spliced. A low ratio means transcription has slowed and the remaining spliced mRNA is decaying.

The kinetics follow a coupled pair of ordinary differential equations:

\[\frac{du}{dt} = \alpha - \beta u \quad \text{(unspliced)}\]

\[\frac{ds}{dt} = \beta u - \gamma s \quad \text{(spliced)}\]

where \(u\) is unspliced counts, \(s\) is spliced counts, \(\alpha\) is the transcription rate, \(\beta\) the splicing rate, and \(\gamma\) the degradation rate.

Steady state. When \(\alpha\) is constant and enough time has passed, both \(u\) and \(s\) reach equilibrium (\(du/dt = 0\), \(ds/dt = 0\)):

\[u^* = \frac{\alpha}{\beta}, \quad s^* = \frac{\alpha}{\gamma} = \frac{\beta}{\gamma}\,u^*\]

At steady state, cells lie along the line \(s = (\beta/\gamma)u\) in the phase portrait (unspliced vs. spliced counts per gene, each point is a cell). The ratio \(s/u = \beta/\gamma\) is constant — splicing and degradation are balanced.

Interpreting the phase portrait. When transcription changes, cells deviate from the steady-state line: - Cells above the line (\(s/u > \beta/\gamma\)) have excess spliced mRNA — transcription was recently downregulated; spliced mRNA persists while unspliced decays. - Cells below the line (\(s/u < \beta/\gamma\)) have excess unspliced mRNA — transcription was recently upregulated; unspliced is accumulating while splicing and degradation lag. - Over time, the system relaxes back toward steady state, tracing a loop in the phase portrait. The direction of this loop (upregulation → below line → recovery, or downregulation → above line → recovery) encodes the velocity signal.

The original RNA velocity model (La Manno et al., 2018) estimated \(\beta/\gamma\) from cells assumed to be at steady state, then computed each cell’s velocity as its residual from the steady-state line. scVelo (Bergen et al., 2020) replaces this steady-state assumption with a dynamical model that estimates \(\alpha\), \(\beta\), and \(\gamma\) per gene by fitting the full ODE system across all cells. This is essential for transient states — in a developing tissue, few genes are at equilibrium, and the steady-state assumption systematically underestimates velocity for recently induced genes.

Preprocessing. Unspliced and spliced counts are not available from standard scRNA-seq quantification. The preprocessing step (velocyto, or spliced/unspliced quantification in alevin-fry and kallisto-bustools) classifies each read as unspliced (overlapping intronic sequence) or spliced (spanning exon-exon junctions) against an annotated transcriptome.

CellRank: Integrating Velocity with Connectivity

CellRank (2021) combines RNA velocity with transcriptomic similarity to model cell fate probabilities. It constructs a transition matrix where each entry \(T_{ij}\) is the probability that cell \(i\) transitions to cell \(j\), combining: - Transcriptomic similarity: cells with similar expression profiles are more likely to be adjacent in developmental time. - Velocity direction: cells should transition in the direction of the velocity field, not against it.

The combined Markov chain is analyzed to identify absorbing states (terminal cell types) and compute the probability of reaching each terminal state from every starting cell. This identifies:

Root cells: high probability of transitioning away from
Terminal states: high probability of transitioning toward
Driver genes: genes whose velocity best predicts the fate decision

Unlike pseudotime alone, CellRank does not require an explicit root assumption — directionality is inferred from the velocity field. The Markov chain formulation also provides a natural measure of fate commitment: cells with high entropy in their transition probabilities are multipotent progenitors; cells with low entropy are committed to a specific fate.

Validation: When the Trajectory is Wrong

Trajectory inference produces a compelling visual output — cells arranged along a smooth path from start to end — but this ordering is an inference, not an observation. The trajectory can be wrong in several ways that are invisible from the output alone:

False trajectories: Pseudotime algorithms will order any dataset, even one with no biological trajectory. Clustering followed by trajectory inference on random noise produces a smooth path through the noise.
Cell cycle as dominant signal: A velocity field that points in circles instead of along a trajectory suggests the dominant signal is the cell cycle, not differentiation. Before interpreting velocity as differentiation direction, validate against cell cycle phase markers — if the top velocity-driven genes are histone genes and cyclins, the signal is likely cycling, not development.
Splicing rate assumptions: RNA velocity assumes constant splicing rates across cells. These fail in cell cycle, stress response, and metabolic transitions where splicing rates change rapidly. The phase portrait (unspliced vs. spliced per gene) is the most informative diagnostic: genes with clear loops are the best candidates for driving the trajectory; genes with diffuse clouds are consistent with steady-state noise or cell cycle variation.
Root sensitivity: Pseudotime results depend on the root choice. A trajectory that appears robust to different root choices is more trustworthy than one that rearranges completely.

Velocity and pseudotime should be validated against known time points or orthogonal markers before trusting the inferred ordering. A trajectory that looks biologically plausible but disagrees with well-established developmental markers should be regarded with skepticism.

Biological Interpretation

Pseudotime requires an explicit root assumption; changing the root changes all downstream interpretations. RNA velocity removes the root assumption but adds assumptions about constant splicing rates across cells — these fail in cell cycle, stress response, and metabolic transitions where splicing rates change rapidly.

The trajectory is only as good as the clustering it builds on. If the underlying cell types are incorrectly identified (Chapter 13), the trajectory connecting them is meaningless. Conversely, a good trajectory can reveal that what appeared as two distinct clusters are actually endpoints of a continuous gradient — a common case in immune activation, where resting and activated states of the same cell type are separated by a continuum of intermediate states.

Trajectory inference from a single time point is fundamentally underdetermined. Multiple distinct biological processes can produce the same observed expression patterns. The trajectory is a hypothesis about the developmental process, not a measurement of it. Independent validation — time series experiments, genetic perturbations, or orthogonal assays — is required to confirm the inferred ordering.

The phase portrait visualization (unspliced vs. spliced per gene) is the most informative diagnostic: genes with clear loops (hysteresis) are the best candidates for driving the trajectory; genes with diffuse clouds are consistent with steady-state noise or cell cycle variation. A trajectory driven entirely by genes with weak phase portraits should be treated as a tentative hypothesis.

Current Landscape (Q2 2026)

Multi-omic velocity (MultiVelo, ChromVelo) extends RNA velocity to chromatin accessibility, predicting regulatory state transitions by modeling unspliced ATAC signal and linking chromatin opening to transcriptional output.
Dynamo (2025) uses metabolic labeling to estimate absolute RNA transcription and degradation rates without assuming steady state, replacing relative velocity estimation with absolute flux measurements that enable quantitative comparison of kinetic rates across genes.
Generative trajectory inference (TrajectoryNet, scGen) uses neural ODEs to interpolate between time points and predict unobserved intermediate states, enabling in silico perturbation of developmental trajectories.
Spatial trajectory inference (stLearn, SpaGCN) incorporates tissue coordinates alongside transcriptomic similarity, linking developmental transitions to anatomical position and revealing tissue architecture as a driver of differentiation.
Causal trajectory inference (cebra, 2023) uses contrastive learning to embed cells such that the embedding space captures causal relationships between perturbations and cell states, moving beyond correlation-based pseudotime.

Summary and Required Reading

Pseudotime orders cells along a developmental trajectory using transcriptional similarity — requires explicit root specification; changing the root changes all results.
RNA velocity uses unspliced/spliced ratios to predict future expression states from a single snapshot, removing the root assumption.
scVelo’s dynamical model relaxes the steady-state assumption for more accurate velocity in transient states, fitting splicing kinetics per gene.
CellRank integrates velocity with connectivity to identify root and terminal states and driver genes without requiring an explicit root.
Trajectory inference is fundamentally underdetermined — validate against known time points, orthogonal markers, or perturbation data before trusting the inferred ordering.

Required Reading

Trapnell et al.: “The dynamics and regulators of cell fate decisions” (Nature Biotechnology, 2014).
Bergen et al.: “Generalizing RNA velocity to transient cell states through dynamical modeling” (Nature Biotechnology, 2020).

Johnson’s Rule: Biology is a fluid continuum. If you force developing cells into discrete clusters, you destroy the timeline of their evolution.