Chapter 17: Mass Spectrometry and Proteomics

Johnson’s First Principle: mRNA is a Blueprint, Not a Machine

The correlation between mRNA abundance and protein abundance is approximately 40-60%. Translation rate, protein half-life, post-translational modifications, and active secretion decouple protein levels from transcript levels. If you want to understand what the cell is actually doing, measure proteins. RNA is just noisy intent.

This gap between transcript and protein is not a measurement error — it is a biological reality. A strongly regulated mRNA may have no detectable protein change due to long protein half-life (days for histones, hours for metabolic enzymes). Conversely, a protein may be activated by phosphorylation or degraded by ubiquitination without any change in its mRNA. Chapters 4 and 13 taught you to measure RNA; this chapter teaches you to measure the actual molecular machines.

Core Concepts

The Physics of Mass Spectrometry

Mass spectrometry measures the mass-to-charge ratio ($m/z$) of ionized molecules. The entire pipeline — from peptide separation to ion detection — is designed around this single measurement, and each component introduces its own constraints on what can be detected.

Ionization (ESI). Electrospray ionization creates charged droplets from the liquid chromatography (LC) effluent. As solvent evaporates, droplets undergo Coulombic fission, releasing gas-phase ions with multiple charges. The key consequence: a peptide of mass 2,000 Da with 2 charges appears at $m/z = 1000$; the same peptide with 3 charges appears at $m/z \approx 667$. Multiple charge states mean the same peptide appears at multiple $m/z$ values, complicating quantification but also enabling detection of high-mass peptides on instruments with limited $m/z$ range.

Mass analyzers. The choice of analyzer determines resolution, speed, and accuracy:

Quadrupole: Four parallel rods with alternating RF/DC voltages. Acts as a mass filter — only ions within a specific $m/z$ window pass through. Low resolution ($$1,000 FWHM) but fast scanning. Used for precursor selection in MS/MS (isolate the ion of interest for fragmentation) and for DIA isolation windows.
Orbitrap: Ions oscillate around a central electrode. The frequency of axial oscillation is measured and converted to $m/z$ via Fourier transform. High resolution (up to 1,000,000 FWHM) capable of resolving isotopic fine structure. The Orbitrap is the dominant high-resolution analyzer in modern proteomics because it combines high resolution with reasonable scan speed.
TOF (Time-of-Flight): Ions are accelerated by a known voltage and their flight time to the detector is measured. Lighter ions arrive first. TOF has effectively unlimited $m/z$ range and fast acquisition speeds, making it ideal for DIA and for detecting intact proteins (top-down proteomics).

MS/MS and acquisition modes. In data-dependent acquisition (DDA), the instrument selects the top $N$ most abundant precursor ions from each MS1 scan, fragments them by collision-induced dissociation (CID) or higher-energy collisional dissociation (HCD), and records the fragment ion spectrum (MS2). DDA is stochastic: the same peptide may be selected in one run but missed in the next due to minor abundance fluctuations. This “missing data problem” is the primary limitation of DDA — across multiple runs, only ~70% of detectable peptides are consistently identified.

Data-independent acquisition (DIA) solves the stochasticity problem by fragmenting all precursor ions within sequential isolation windows (typically 25-50 Da) across the full $m/z$ range. Every peptide is fragmented regardless of abundance. The cost is spectral complexity — each DIA spectrum contains fragments from thousands of co-eluting peptides, requiring computational decoding against a spectral library: a collection of known peptides with their fragmentation patterns and retention times, built from prior DDA runs or synthetic peptides. The library provides a template for each peptide’s MS2 spectrum; matching algorithms (Spectronaut, DIA-NN, OpenSWATH) decode the multiplexed DIA spectra by finding the combination of library entries that best explains the observed fragment ion intensities. This library-based decoding is the key conceptual innovation that makes DIA work — it replaces stochastic precursor selection with deterministic, comprehensive fragmentation. DIA achieves >10,000 protein identifications per run with <1% missing values across replicates, making it the current standard for quantitative clinical proteomics.

Bottom-Up Shotgun Proteomics

Proteins are digested with trypsin, which cleaves after lysine (K) and arginine (R) — unless followed by proline. The choice is deliberate: K and R are abundant and evenly distributed across most proteomes, producing peptides of 5-30 amino acids — the ideal length for MS/MS sequencing. Trypsin also leaves a basic residue at the C-terminus of each peptide, which promotes positive-charge retention during ESI and produces predictable fragment ion series (y-ions) in MS/MS.

The digested peptides are separated by reverse-phase liquid chromatography on a C18 column with a gradient from aqueous (0.1% formic acid in water) to organic (0.1% formic acid in acetonitrile). Peptides elute in order of increasing hydrophobicity — short, hydrophilic peptides elute early; long, hydrophobic peptides elute late. This elution order provides an orthogonal dimension to $m/z$: two peptides with similar mass can be resolved if their hydrophobicity differs. A typical 2-hour gradient resolves ~50,000 peptide species, though peak capacity scales with gradient length — longer gradients (4+ hours) are used for deep discovery proteomics. After separation, peptides are ionized by ESI and analyzed by MS/MS. Each peptide yields a fragmentation spectrum that is searched against a protein sequence database.

Database searching (SEQUEST, Andromeda) compares the observed MS2 spectrum to theoretical spectra derived from in silico digestion of the protein database. The SEQUEST XCorr score measures the cross-correlation between observed and theoretical spectra, accounting for the fact that fragment ion intensities are not perfectly predicted — the correlation is computed across multiple alignment offsets to find the best match. Andromeda (the MaxQuant search engine) uses a probabilistic scoring model based on the probability of matching fragment ions by chance, given the precursor mass tolerance and fragment mass tolerance.

The fundamental limitation: database search can only identify peptides present in the reference database. Novel peptides from unannotated genes, sample-specific variants, or unexpected post-translational modifications will not match and remain unidentified. Open modification searches and spectral library searching partially address this.

The PSM-to-protein hierarchy. The path from a raw spectrum to a reported protein involves three levels of inference, each with its own error structure. At the bottom, a peptide-spectrum match (PSM) links one observed spectrum to one peptide sequence — this is the atom of proteomics identification. Multiple PSMs for the same peptide (from replicate scans, different charge states, or modified variants) are aggregated to a peptide identification. Finally, one or more peptides support a protein identification, typically requiring at least two unique peptides per protein (the two-peptide rule). Errors propagate upward: a single false PSM can falsely support a peptide that in turn falsely supports a protein. The target-decoy framework below addresses this by estimating FDR at each level — PSM-level, peptide-level, and protein-level FDR are reported separately, and the protein-level FDR is always the most stringent because it counts any protein with even one false peptide as a false discovery.

Target-Decoy FDR

Decoy sequences (reversed or shuffled protein sequences) are appended to the database. A PSM matching a decoy sequence is, by definition, a false positive. The false discovery rate at the PSM level is:

\[\text{FDR} = \frac{\# \text{decoy hits}}{\# \text{target hits}}\]

At 1% FDR threshold, 1% of the reported PSMs are expected to be false. This is the universally accepted standard for proteomics identifications. Without a decoy database, proteomics results are untrustworthy — the search engine will always find the best match for each spectrum, and the reported score is meaningless without knowing how many matches would be expected by chance.

The protein inference problem. A single peptide may match multiple protein isoforms (shared peptides). When a peptide is shared between isoform A and isoform B, the software cannot determine which isoform contributed it. Most search engines report “protein groups” — sets of proteins that cannot be distinguished by the detected peptides. The reported abundance of a protein group is a weighted sum of all isoforms sharing the detected peptides. This ambiguity is structural: it cannot be resolved without longer peptides, different proteases, or orthogonal data (RNA-seq isoform quantification).

Quantification

Label-free quantification (LFQ): Peptide intensities are integrated across LC-MS runs (area under the chromatographic peak) and matched between runs by retention time and accurate mass. The “match-between-runs” feature in MaxQuant transfers identifications from one run to another for peptides that were detected but not selected for MS/MS fragmentation, increasing quantification depth by ~30-50%. LFQ assumes consistent LC performance across runs — retention time shifts of >1 minute cause matching failures.

TMT (Tandem Mass Tag): Isobaric tags label peptides from different samples (up to 18-plex with TMTpro). Each tag has the same total mass but different distribution of heavy isotopes. All labeled peptides co-elute and have the same precursor $m/z$, but upon MS/MS fragmentation, reporter ions at distinct $m/z$ values (126-134 Da) are released. The reporter ion intensities encode relative abundance across samples.

Ratio compression. TMT suffers from a systematic bias: co-isolated peptide ions (precursors within the isolation window that are fragmented alongside the target peptide) contribute their reporter ions to the MS2 spectrum, diluting the true ratio. A true 10-fold change between conditions may appear as only 2-3 fold because the co-isolated peptides dilute the signal. Multi-notch MS3 (SPS-MS3) on the Orbitrap Fusion family reduces ratio compression by fragmenting the MS2 fragment ions a second time, isolating only the most abundant reporter ions — but at the cost of reduced sensitivity.

SILAC (Stable Isotope Labeling by Amino Acids in Cell Culture). Instead of labeling peptides after digestion (TMT), SILAC incorporates heavy isotopes metabolically: cells are grown in media containing light (Arg-0, Lys-0) or heavy (Arg-10, Lys-6) amino acids. After several doublings, all proteins in the “heavy” culture are fully labeled. Samples are mixed at the cell-culture stage, digested together, and analyzed in a single LC-MS run. Since labeling occurs before any processing, light and heavy peptides experience identical handling — there is no ratio compression because the labeled and unlabeled forms are chemically identical and co-elute. The mass shift (~6 Da for Lys-6, ~10 Da for Arg-10) is detected in the MS1 scan as a paired peak; the ratio of peak intensities is the relative abundance. SILAC is the gold standard for quantification accuracy but is limited to cell culture (not tissue) and classical 3-plex (though super-SILAC and SILAC with neutron encoding extend multiplexing). It remains the reference method against which LFQ and TMT accuracy are evaluated.

Phosphoproteomics and PTM Analysis

Phosphorylation is the most studied post-translational modification, but it illustrates a general principle that applies to ubiquitination, acetylation, and methylation: detecting the modification is not enough — localizing it to the correct residue is essential.

Phosphorylation site enrichment (TiO₂, IMAC) isolates phosphorylated peptides from the non-phosphorylated bulk by affinity binding. After enrichment, site localization scoring (Ascore, PhosphoRS) determines which serine/threonine/tyrosine residue carries the phosphate. The Ascore computes the probability that the observed fragment ions are best explained by phosphorylation at site A vs. site B. An Ascore > 19 ($p < 0.01$) is considered confident localization.

Why localization matters: a phosphopeptide with ambiguous site assignment cannot be interpreted biologically — phosphorylation of Ser473 vs. Thr308 on AKT have completely different functional consequences (Ser473 is required for full AKT activation; Thr308 phosphorylation alone produces partial activity). Reporting “AKT phosphorylation detected” without site localization is biologically meaningless.

Biological Interpretation

Without a decoy database at 1% FDR, proteomics identifications are untrustworthy. The peptide-to-protein inference problem (shared peptides across protein isoforms) means many reported proteins are actually “protein groups” — the software cannot distinguish between isoforms, and the reported abundance is a weighted sum across all isoforms sharing the detected peptides.

Phosphoproteomics requires careful site localization scoring. A peptide with multiple serines where the phosphorylation site is ambiguous cannot be assigned to a specific signaling pathway. Many published phosphoproteomics datasets report sites at resolution insufficient to distinguish which kinase was active.

The correlation between mRNA and protein abundance (~40-60%) means that transcript-level differential expression cannot substitute for proteomics. A gene with unchanged mRNA but strongly altered protein abundance may be regulated at the translational or post-translational level. Conversely, a strongly regulated mRNA may have no detectable protein change due to long protein half-life. This is not a limitation of either technology — it reflects the orthogonal biological information each provides.

The dynamic range of the proteome spans ~10 orders of magnitude from the most abundant protein (albumin in plasma) to signaling molecules and transcription factors at the low end. A mass spectrometer can measure only 4-5 orders simultaneously — the instrument detects the most abundant ions while the low-abundance ones are suppressed by the high-abundance background. This is the fundamental detection bias of shotgun proteomics: the instrument always prioritizes abundant peptides, and low-abundance proteins (transcription factors, kinases, cell-surface receptors) are systematically under-sampled. Prefractionation (SCX, high-pH RP) extends coverage by reducing sample complexity before LC-MS analysis.

DDA and DIA produce systematically different missing data patterns. DDA has high missingness (~30% of detectable peptides are not identified in any given run), driven by the stochastic precursor selection and the dynamic range limit — the top-speed precursor selection always picks the most abundant ions, and low-abundance peptides are rarely fragmented. DIA has low missingness but higher quantification noise from the complex multiplexed spectra. The choice between acquisition strategies depends on the biological question: DIA for discovery across many samples (clinical cohorts, biomarker screens) where consistent coverage matters more than absolute depth; DDA with fractionation for deep proteome coverage where quantification consistency across runs is less important; PRM (parallel reaction monitoring) or SRM (selected reaction monitoring) for targeted quantification of specific proteins where sensitivity and reproducibility are paramount.

Current Landscape (Q2 2026)

DIA (data-independent acquisition, e.g., diaPASEF, timsTOF) has become the standard for deep, reproducible proteome coverage, routinely detecting >10,000 proteins per run with <5% missing values across replicates in clinical cohorts.
Single-cell proteomics (nanoPOTS, SCoPE2, Bruker timsTOF SCP) achieves ~1,000 proteins per cell, approaching the coverage needed for cell-type classification by proteotype rather than transcriptotype.
Proteogenomics (proteomics data searched against sample-specific variant databases from RNA-seq or DNA-seq) identifies tumor-specific neoepitopes for cancer immunotherapy, linking somatic mutations to protein-level drug targets.
AlphaFold-predicted structures are being integrated into MS/MS search engines to improve peptide identification in modified and structurally variable regions by predicting fragment ion intensities from 3D structure.
Proximity labeling (BioID, APEX) maps protein-protein interactions and subcellular proteomes in living cells by biotinylating proteins within a defined radius (~10 nm) of a bait protein, enabling proteomic characterization of cellular compartments and transient interactions that traditional affinity purification cannot capture.

Summary and Required Reading

Mass spectrometry measures $m/z$ — ESI ionization produces multiply charged peptides; Orbitrap/TOF/quadrupole analyzers offer different resolution and speed tradeoffs.
DDA vs. DIA are complementary acquisition modes: DDA is stochastic (high missingness), DIA fragments everything (low missingness, complex deconvolution via spectral libraries). DIA is now standard for clinical proteomics; DDA with fractionation for deep coverage; PRM/SRM for targeted quantification.
Target-decoy FDR is the minimum quality standard — without a decoy database, identifications are untrustworthy. The PSM-to-protein hierarchy requires FDR estimation at the PSM, peptide, and protein levels, with protein-level FDR being the most stringent constraint.
The proteome spans ~10 orders of magnitude in abundance, but an instrument covers only 4-5 — prefractionation extends coverage, and low-abundance proteins (kinases, transcription factors) are systematically under-sampled.
Protein inference is ambiguous — shared peptides mean reported “proteins” are often protein groups. Isoform-level inference requires orthogonal data.
TMT ratio compression systematically underestimates fold changes due to co-isolated peptide contamination — SPS-MS3 reduces but does not eliminate this.
PTM site localization is essential for biological interpretation — detecting phosphorylation (or ubiquitination, acetylation) is not enough; the specific residue determines functional meaning.

Required Reading

Cox & Mann: “MaxQuant enables high peptide identification rates” (Nature Biotechnology, 2008).
Elias & Gygi: “Target-decoy search strategy for increased confidence in large-scale protein identifications” (Nature Methods, 2007).

Johnson’s Rule: mRNA is a blueprint; the protein is the machine. RNA is just noisy intent.