Chapter 20: Perturb-seq and Functional Genomics
Johnson’s First Principle: Correlation is Not Causation
Observational multi-omics identifies correlation. A gene co-expressed with a phenotype across 500 patients may be a driver, a bystander, or a compensatory response. Only direct physical perturbation — knocking out the gene and measuring the consequence — can definitively map causal biological dependency networks.
The preceding chapters of this tier have built a sophisticated toolkit for observing molecular states. Chapter 16 taught you to build networks from correlations; Chapter 17 to measure proteins; Chapter 18 to integrate across modalities. But every one of these methods is fundamentally observational. You can correlate expression with phenotype, build co-expression networks, and integrate across omics layers — and still not know which genes are causal. Perturbation resolves this ambiguity by changing the system and measuring the response.
Core Concepts
CRISPR-Cas9 Biology
The CRISPR-Cas9 system: a guide RNA (gRNA) directs the Cas9 nuclease to a specific genomic locus via sequence complementarity. Cas9 creates a double-strand break ~3 bp upstream of the protospacer adjacent motif (PAM, NGG for SpCas9). The break is repaired by either: - Non-homologous end joining (NHEJ): Error-prone repair that introduces small insertions or deletions (indels). When the indel shifts the reading frame (frameshift), the result is a gene knockout via premature stop codon and nonsense-mediated decay. ~65-80% of indels are frameshifts. - Homology-directed repair (HDR): High-fidelity repair using a template. Requires co-delivery of a repair template (ssODN or plasmid). HDR efficiency is low (1-10% in most cell types) and cell-cycle dependent (G2/S phase only).
gRNA design principles. The gRNA must be uniquely complementary to the target gene and not match anywhere else in the genome. Design rules: target early constitutive exons (frameshifts in early exons are more likely to produce complete loss of function), avoid exons with alternative splicing or partial functional redundancy, and design multiple gRNAs per gene (3-5) to distinguish on-target effects from off-target edits. Tools: CRISPick (Broad Institute), CasOFFinder for off-target prediction.
Pooled CRISPR Screens
A pooled screen scales perturbation to the entire genome. A library of ~100,000 gRNAs targeting all ~20,000 genes (5 gRNAs per gene, plus negative controls) is synthesized as an oligonucleotide pool, cloned into a lentiviral vector, packaged into lentivirus, and transduced into cells at low multiplicity of infection (MOI < 0.3) to ensure most transduced cells receive exactly one gRNA. The number of gRNAs per cell follows a Poisson distribution (Chapter 4) with parameter \(\lambda = \text{MOI}\). At MOI = 0.3, the probability of receiving two or more gRNAs is \(P(k \geq 2 \mid \lambda = 0.3) \approx 0.04\) — acceptably low for screens where assigning phenotype to a single perturbation is essential.
Screen types: - Positive selection (enrichment): gRNAs that confer a growth advantage become enriched over time. Used to identify drug resistance genes (transduce cells, apply drug, sequence gRNAs from survivors). - Negative selection (dropout): gRNAs targeting essential genes become depleted over time. Used to identify genes required for cell survival under specific conditions. Dropout screens require careful normalization — gRNA abundance at day 0 vs. day 14 across control and treatment conditions. - Fluorescence-activated screens: Cells are sorted by marker expression (e.g., a surface protein for a phenotype of interest), and gRNA abundance in the high vs. low bins identifies genes that regulate that marker.
gRNA quantification from sequencing. After the selection period, genomic DNA is extracted from the cell population, and the integrated gRNA cassette is amplified by PCR. The resulting amplicon is sequenced, and reads are mapped back to the original gRNA library sequences. Each gRNA’s read count is proportional to its representation in the surviving cell population — depletion relative to the initial time point (day 0) indicates that cells carrying that gRNA were lost (negative selection), and enrichment indicates a growth advantage (positive selection). The output is a count matrix (gRNAs \(\times\) time points or conditions) that feeds into the statistical models below.
Statistical analysis. The Negative Binomial distribution (Chapter 4) models gRNA count data, where the mean-variance relationship \(\text{Var}(Y) = \mu + \alpha \mu^2\) captures both sequencing noise and biological variation across cells. Tools (MAGeCK, BAGEL) test each gRNA’s abundance change against the negative control gRNAs (non-targeting sequences) to estimate a per-gene effect size and false discovery rate. MAGeCK uses a modified NB model with mean-variance estimation pooling across all gRNAs; BAGEL uses a Bayesian framework with a prior derived from known essential and non-essential gene sets.
Copy number bias correction. In cancer cell lines with extensive chromosomal amplifications, a gRNA targeting an amplified region has additional target copies per cell, producing systematically higher gRNA abundance that mimics an essential gene signature — the gRNA does not appear to drop out because there are more copies to cut, not because the gene is required for survival. The CERES algorithm (Meyers et al., 2017) models each gRNA’s count as a function of both gene essentiality and local copy number, using copy number data (from WGS or SNP arrays) as a measured covariate to decouple the two effects. Without this correction, genes in amplified regions show false positive essentiality rates of 30-50% — a systematic bias that invalidates uncorrected essentiality screens in aneuploid cell lines. This is the most common computational artifact in cancer CRISPR screens and should be checked before interpreting any essentiality hits.
Perturb-seq: CRISPR + Single-Cell Readout
Pooled screens measure only a single phenotype (survival, marker expression). Perturb-seq marries CRISPR perturbation with scRNA-seq readout: each cell receives a gRNA, and the full transcriptome is measured simultaneously. This enables measurement of the complete transcriptional consequence of each knockout — not a single phenotype but thousands of expression features per perturbation.
The experimental pipeline. Cells are transduced with a gRNA library, then processed through the 10x Chromium (Chapter 13). The gRNA is detected as an additional feature (the gRNA transcript is captured by a specific primer during reverse transcription and sequenced alongside the cell barcode). Each cell’s transcriptome is linked to its gRNA, identifying which gene was knocked out in that cell.
The computational challenge is detecting the perturbation signature against background variation. A knockout of a transcription factor may affect hundreds of genes, but each gene’s individual expression change is small and noisy. The signal is in the coordinated response across multiple genes, not in any single gene’s fold change.
- Mixscape (Seurat) classifies each cell as “perturbed” or “non-perturbed” for each gRNA using a regularized NB model. A knockout with 50% cutting efficiency means half the cells carrying a gRNA have no functional knockout — classifying them correctly is essential for detecting the perturbation signature. Mixscape uses the expression of a set of target genes (known or learned) to compute a “perturbation score” per cell.
- scMAGeCK applies NB testing per gene per perturbation, computing enrichment scores for gene sets rather than individual genes. It aggregates evidence across all cells carrying the same gRNA using a weighted sum approach that accounts for varying knockout efficiency.
Interpreting Screen Results
Essential genes: gRNAs that deplete over time identify genes required for cell survival. These are potential cancer drug targets — but a gene essential in one cell type may be dispensable in another. The “essentialome” is context-dependent, not universal.
Contextual essentiality and synthetic lethality. A gene may be essential only in specific genetic backgrounds. The classic example: BRCA1/2 mutations sensitize cells to PARP inhibition because the two DNA repair pathways (homologous recombination and base excision repair) compensate for each other. Loss of both is lethal, loss of either alone is viable. This is the synthetic lethality concept: two genes are synthetically lethal if mutation in either alone is viable but mutation in both is lethal. Cancer cells accumulate hundreds of mutations; each creates a vulnerability to synthetic lethal targeting.
The computational challenge: distinguishing true synthetic lethality from additive effects of two independent perturbations (each causes 20% growth reduction; together they cause 40% — this is additive, not synergistic). The difference between additive and synergistic effects requires modeling the expected combined effect under independence (typically Bliss independence: \(E_{AB} = E_A + E_B - E_A E_B\)) and testing for significant deviation.
Biological Interpretation
A Perturb-seq result must be interpreted through the lens of specificity and penetrance. A knockout that causes 50% of cells to change their expression state and 50% to remain unchanged suggests either: incomplete editing (Cas9 did not cut in all cells), compensatory mechanisms (paralogous genes buffer the loss), or a probabilistic threshold effect (the expression change requires additional signals present in only a fraction of cells). The response distribution — not just the mean effect — contains the biological information. A gene with a large mean effect driven by a small fraction of strongly responding cells and a majority of non-responders is fundamentally different from one with a moderate but uniform response across all cells.
Cutting efficiency is not knockout efficiency. A gRNA that cuts the genome with 90% efficiency may produce only 60% functional knockout — the remaining 30% of cuts were in-frame indels that preserve protein function. This discrepancy between molecular editing (detected by sequencing) and functional perturbation (detected by phenotype) is the most common source of false negatives in CRISPR screens. Measuring protein abundance by FACS or western blot is the gold standard for confirming functional knockout.
Synthetic lethality is the most clinically actionable perturbation concept, but the gap between in vitro synthetic lethal hits and in vivo therapeutic efficacy is large. A gene pair that is synthetically lethal in a 2D culture may not be synthetically lethal in a 3D tumor microenvironment with hypoxia, immune pressure, and stromal interactions. In vivo CRISPR screens using barcoded tumor libraries in mice are bridging this gap, but at substantially higher cost and complexity.
Current Landscape (Q2 2026)
- Prime editing screens enable single-nucleotide resolution perturbation, moving beyond full knockouts to model specific patient variants — enabling functional characterization of every missense mutation in a tumor suppressor gene.
- In vivo CRISPR screens using barcoded tumor libraries in mice are identifying genes required for metastasis, immune evasion, and drug resistance in physiologically relevant microenvironments, revealing that many in vitro essential genes are dispensable in vivo.
- Perturb-seq at scale: 10x Genomics now offers CRISPR screening as an integrated product with 100,000+ cells per experiment, making Perturb-seq accessible to individual labs without custom microfluidics.
- Base editing and CRISPRa/i screens (CRISPR activation and interference) enable perturbation of non-coding regulatory elements at high throughput, addressing the ~98% of the genome that does not code for protein and is inaccessible to knockout-based screens.
- Multiplexed perturbation (Perturb-seq with 2-3 gRNAs per cell) maps genetic interactions at scale, measuring how the effect of one knockout depends on the presence of another — the functional genomics equivalent of epistasis mapping.
Summary and Required Reading
- CRISPR-Cas9 creates double-strand breaks repaired by NHEJ (frameshift knockout) or HDR (precise editing). gRNA design requires early constitutive exons and multiple guides per gene.
- Pooled screens measure enrichment or depletion of gRNAs under selection — the Negative Binomial model (Chapter 4) provides the statistical framework for identifying significant hits.
- Perturb-seq combines CRISPR with scRNA-seq readout, measuring the complete transcriptional consequence of each knockout — the signal is in coordinated gene responses, not individual fold changes.
- Synthetic lethality identifies drugs targeting cancer-specific vulnerabilities — distinguishing synergy from additive effects requires modeling expected combined effects under independence.
- Cutting efficiency ≠ knockout efficiency — in-frame indels preserve function; measure protein to confirm knockout.
Required Reading
- Shalem et al.: “Genome-scale CRISPR-Cas9 knockout screening in human cells” (Science, 2014).
- Dixit et al.: “Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens” (Cell, 2016).
- Meyers et al.: “Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells” (Nature Genetics, 2017).
Johnson’s Rule: If you want to know whether a gene matters, don’t look at its expression — knock it out.