Chapter 10: Variant Calling and Hidden Markov Models

Published

June 5, 2026

Modified

June 19, 2026

Johnson’s First Principle: A Variant is a Probability, Not a Fact

At a position with 25 reference reads and 5 alternate reads, is that “G” a true mutation or 5 sequencing errors? You cannot trust your eyes, and you cannot simply count reads. You must mathematically prove the variant by combining the known error rate of the machine with the evolutionary probability of the genome. A variant call is a posterior probability; the product of the data likelihood (derived from quality scores) and a prior (the population frequency of the variant type).

Core Concepts

Bayesian Genotyping

All modern variant callers use Bayes’ theorem to compute the probability of each possible genotype \(G\) given the observed sequencing data \(D\):

\[P(G \mid D) = \frac{P(D \mid G) \cdot P(G)}{P(D)}\]

The prior \(P(G)\) encodes population genetics expectations. In humans, the probability that a given position is heterozygous is approximately \(10^{-3}\); most positions match the reference. This prior prevents callers from being overly influenced by rare sequencing errors.

The likelihood \(P(D \mid G)\) is derived from the Phred quality scores of the reads covering the position. For a single read with base call \(b\) and quality score \(Q\):

\[P(\text{correct base}) = 1 - 10^{-Q/10}\]

For 5 alternate reads out of 30 total, the likelihood ratio comparing the heterozygous genotype (AG) to the homozygous reference (AA) is:

\[LR = \frac{P(5 \text{ alt}, 25 \text{ ref} \mid \text{het})}{P(5 \text{ alt}, 25 \text{ ref} \mid \text{hom ref})}\]

If the reads are Q30 (0.1% error rate each), the probability of observing 5 alternate reads under the homozygous reference is a binomial: \(\binom{30}{5} (0.001)^5 (0.999)^{25} \approx 8 \times 10^{-12}\). The heterozygous model assigns probability \(\binom{30}{5} (0.5)^5 (0.5)^{25} \approx 4 \times 10^{-4}\). The likelihood ratio is \(4 \times 10^{-4} / 8 \times 10^{-12} \approx 5 \times 10^7\) (overwhelming evidence for the heterozygous call. If the same reads are Q10 (10% error rate), the homozygous reference probability becomes \(\binom{30}{5} (0.1)^5 (0.9)^{25} \approx 0.02\), and the LR drops to \(4 \times 10^{-4} / 0.02 \approx 0.02\)) evidence for the reference call. The quality scores are what convert a raw read count into a statistically meaningful genotype call.

Base quality score recalibration (BQSR). Quality scores from the sequencer are systematically biased; they are calibrated globally, not per-base-context. A C followed by G in a CG dinucleotide context has a different empirical error rate than an A in an AT-rich context, even if both were assigned Q30 by the sequencer. BQSR corrects this by tabulating empirical error rates stratified by: reported quality score, base context (the surrounding 2-3 bases), and read group (flow cell lane). Using known variant sites (dbSNP) to distinguish true variation from sequencing artifacts, BQSR builds a recalibration table and adjusts each base’s quality score so that Q30 empirically means 1/1000 error rate in every context. The critical constraint: BQSR requires a set of known variant sites. For non-human species where dbSNP-equivalent resources do not exist, BQSR should be skipped, applying it without known variants will recalibrate true variation as error and degrade sensitivity.

Why an HMM? Alignment Uncertainty at the Haplotype Level

Bayesian genotyping as described above assumes reads are already correctly aligned. But alignment itself is uncertain, especially around indels, where a read may align in multiple equally plausible ways. A read spanning a 2 bp deletion relative to the reference could be aligned with the deletion in any of several positions depending on the local sequence context.

GATK HaplotypeCaller addresses this through a multi-step workflow that uses a Hidden Markov Model to account for alignment uncertainty:

Active region detection: Find positions where reads disagree with the reference beyond expected error rates. These are regions where variation is plausible.
Local de novo assembly: Extract all reads in the region and assemble candidate haplotypes using a De Bruijn graph. This generates the most likely alternative sequences given the reads, without assuming the reference is correct.
Pair-HMM alignment: For each read-haplotype pair, compute the probability that the read originated from that haplotype. This is an HMM because it models the alignment path as a sequence of hidden states (match, insert, delete), marginalizing over all possible paths rather than choosing a single best alignment.
Bayesian genotyping: For each position and each candidate allele, compute \(P(G \mid D)\) using the pair-HMM likelihoods and the population prior.

The pair-HMM is the computational core. Unlike Smith-Waterman, which finds a single optimal alignment, the pair-HMM sums over all possible alignments:

\[P(\text{read} \mid \text{haplotype}) = \sum_{\text{all alignments}} P(\text{alignment}) \cdot P(\text{read bases} \mid \text{alignment})\]

The HMM has three hidden states at each position: M (match: read base aligns to haplotype base), I (insert: read has an extra base relative to the haplotype), and D (delete: haplotype has a base the read skips). Transitions between these states model the biology of sequencing: M→M is the most probable (continuous alignment), M→I and M→D are less probable (an indel event), and the model penalizes rapid state flipping (I→D or D→I transitions are nearly disallowed). Each state emits a base with a probability derived from the Phred quality score: in state M, the read base should match the haplotype base with high probability; in state I, the read base is “extra” and gets a uniform or low probability; in state D, no read base is emitted. The full likelihood is computed by the forward algorithm, which sums over all paths through this state machine in \(O(\text{read length} \times \text{haplotype length})\) time.

This marginalization accounts for alignment uncertainty. If a read can align to a haplotype in two equally good ways (e.g., a homopolymer where the deletion could be at any position), Smith-Waterman picks one arbitrarily, while the pair-HMM properly averages over both possibilities. The result is a likelihood that reflects true uncertainty rather than a false sense of precision.

VCF Format and Quality Triaging

The VCF (Variant Call Format) encodes each variant call with fields that allow downstream filtering:

QUAL: Phred-scaled probability that the variant is not present (\(-10 \log_{10} P(\text{no variant})\))
FILTER: Pass or failed quality thresholds (low QUAL, low depth, strand bias)
AD (allelic depth): Count of reads supporting reference and each alternate allele
GQ (genotype quality): Phred-scaled confidence in the assigned genotype
PL (phred-scaled likelihood): Normalized likelihoods of each possible genotype

A VCF with FILTER=PASS for all entries but inadequate depth or poor quality scores is a VCF that should not be trusted. The FILTER field only checks predefined thresholds; it does not guarantee biological validity. Each of the numeric fields (AD, GQ, PL) should be inspected per variant before accepting a call.

Germline vs. Somatic Calling

Germline calling (GATK HaplotypeCaller, DeepVariant) expects variant allele frequencies of approximately 50% (heterozygous) or 100% (homozygous). The genotype likelihood model and the prior are optimized for these expectations.

Somatic calling (Mutect2, Strelka2) must detect subclonal mutations at any VAF from 1-50%, requiring: - A matched normal sample to filter out germline polymorphisms - A different prior that accommodates low VAFs - Cross-sample contamination modeling (subclonal mutations in the normal appear as germline variants)

The distinction is critical: a germline caller applied to tumor data will systematically reject subclonal mutations as noise because they violate the 50% VAF expectation. Conversely, a somatic caller applied to germline data may call heterozygous variants as low-VAF mutations.

Biological Interpretation

A variant call at 10% VAF in a tumor sample may be: a true clonal mutation in a heterogeneous tumor, an artifact from FFPE deamination (C>T transitions), a read misalignment around a homopolymer, or a germline polymorphism in a sample with contamination. The distinction requires inspecting read-level evidence; the BAM, not just the VCF.

The GATK “best practices” workflow is optimized for human germline short-read data. Applying it without modification to non-human species, long-read data, or somatic samples introduces systematic biases. Base quality score recalibration (BQSR) requires known variant sites (dbSNP): these do not exist for most non-human species, so BQSR should be skipped or adapted.

A matched normal is not optional for somatic calling. Population databases (gnomAD) cannot substitute for the patient’s own germline; they exclude rare private germline variants that are the most likely to be mistaken for somatic mutations.

Summary and Required Reading

Bayesian genotyping combines prior mutation rates with Phred-derived likelihoods. A variant without sufficient evidence is a statistical uncertainty, not a fact.
Pair-HMM alignment marginalizes over alignment uncertainty; unlike Smith-Waterman, it does not commit to a single optimal alignment path.
Germline callers expect 50/100% VAF: somatic calling requires a matched normal and different likelihood models.
VCF encodes probability, not truth: FILTER, AD, GQ, and PL fields must be inspected per variant before trusting any call.

Required Reading

DePristo et al.: “A framework for variation discovery and genotyping using next-generation DNA sequencing data” (Nature Genetics, 2011).
Poplin et al.: “Scaling accurate genetic variant discovery to tens of thousands of samples” (bioRxiv, 2017).

Johnson’s Rule: The confidence in a variant call is only as strong as the weakest link in the quality chain.