Chapter 7: Sequencing Physics and Data Formats

Johnson’s First Principle: File Formats Encode Mechanical Laws

A FASTQ file from an Illumina sequencer encodes substitution error probabilities derived from reversible terminator fluorescence decay. A FASTQ file from an Oxford Nanopore sequencer encodes indel probabilities derived from neural network basecalling of ionic current. Both use the same ASCII character encoding, but the biological meaning of a Q30 score is fundamentally different — one predicts a substitution, the other predicts an indel. If you treat both as “just text files” and apply the same quality filter, you are applying the wrong error model to your data.

File formats are not arbitrary containers. They are the digital translation of the physical measurement made by the sequencer — and you cannot process a file correctly if you do not understand the physics of the machine that generated it.


Core Concepts

Error Profiles Determine Analytic Validity

Every sequencing platform has a characteristic error signature — the type, rate, and positional distribution of mistakes it makes. Choosing a sequencing platform is choosing an error profile, and that profile determines what analyses are valid:

Technology Measurement Error Mode Read Length Throughput (per run)
Illumina SBS Fluorescence (reversible terminators) Substitutions, 3’ quality decay 150-300 bp 10+ billion reads
PacBio SMRT Real-time fluorescence (ZMW) Random indels (correctable by CCS) 10-100 kb ~1 million CCS reads
Oxford Nanopore Ionic current (motor protein) Indels, homopolymer errors 10-200 kb ~100 million reads
Sanger Capillary fluorescence Quality decay with read length 400-900 bp Single read

Illumina sequencing-by-synthesis remains the dominant technology. DNA fragments are bridge-amplified into clonal clusters on a flow cell. Reversible terminator nucleotides with distinct fluorophores are incorporated one cycle at a time. After each cycle, a fraction of molecules fail to incorporate (phasing) or incorporate two nucleotides (pre-phasing), and the fraction of synchronized molecules decays exponentially with cycle number: \(S(n) = S_0 e^{-\lambda n}\) where \(\lambda\) is the combined phasing + pre-phasing rate per cycle. When \(S(n)\) drops below the detection threshold (~cycles 150-300), the read terminates — this is a fundamental physical limit of the sequencing chemistry, not a software truncation. Raw output is in BCL format; demultiplexing and base calling convert it to FASTQ (a step where index hopping artifacts can arise from misassigned barcodes during multiplexed sequencing). The error signature is dominated by substitutions (particularly at the 3’ end where quality decays) with a very low indel rate. Substitution error rate is ~0.1-1%.

PacBio SMRT sequencing immobilizes a single DNA polymerase at the bottom of a zero-mode waveguide (ZMW). Fluorescently labeled nucleotide incorporation is detected in real time. Continuous long reads (CLR) achieve 10-100 kb with ~15% error, but the errors are random indels rather than substitutions — a critical distinction, because random indels can be corrected by consensus. Circular consensus sequencing (CCS/HiFi) ligates hairpin adapters to both ends of the template, creating a circular molecule that the polymerase reads around multiple times. The CCS algorithm detects pass boundaries and computes a consensus: with ~15% single-pass indel error and 10+ passes, the consensus error drops to ~0.1% (\(0.15^{10} \approx 3\times 10^{-9}\) in theory; systematic biases limit accuracy to ~99.9% in practice). The tradeoff is throughput — each CCS read requires many passes, reducing the number of unique molecules sequenced.

Oxford Nanopore sequencing uses a motor protein to feed a single DNA strand through a protein nanopore embedded in an electrically resistant membrane. As 3-5 nucleotides occupy the pore simultaneously, each k-mer produces a characteristic ionic current level. Because the same base produces different currents depending on its flanking nucleotides (a non-linear, context-dependent mapping), a neural network basecaller (Guppy, Dorado) decodes the current signal into bases. Error rates have improved rapidly (currently ~1-5% for raw reads, Q20+), but the error profile is distinct: predominantly insertions and deletions in homopolymer regions (strings of identical bases like AAAAA vs. AAAAAA). Substitutions are rarer. This has implications for variant calling — indels in homopolymers are the most common artifact.

FASTQ and the Meaning of Quality Scores

The FASTQ format encodes each read as four lines: @header, sequence, +, and quality scores. The Phred quality score transforms error probability into a convenient logarithmic scale:

\[Q = -10 \log_{10}(P_{\text{error}})\]

Q Score Error Probability Base Call Accuracy
Q10 10% 90%
Q20 1% 99%
Q30 0.1% 99.9%
Q40 0.01% 99.99%

The same Q score from different platforms encodes fundamentally different error types. An Illumina Q30 means there is a 0.1% chance that the base is a substitution error. A Nanopore Q30 means there is a 0.1% chance that the base is an indel error relative to the true sequence. When filtering reads or computing variant likelihoods, the error type matters as much as the error rate. A substitution error model applied to Nanopore data will systematically underestimate the probability of indels and overestimate the probability of perfect matches.

Illumina uses Phred+33 encoding (ASCII offset 33). Older Solexa data used Phred+64. BWA and SAMtools detect the encoding automatically in most cases, but format conversion scripts can misidentify the offset — a perennial source of hard-to-diagnose quality filtering bugs.

SAM/BAM/CRAM: Alignment Formats

The SAM (Sequence Alignment/Map) format records each aligned read as a line with 11 required tab-separated fields. The critical fields for filtering:

Field Name What It Encodes
1 QNAME Read name
2 FLAG Bitwise flag — encodes pairing, strand, mapping status
3 RNAME Reference sequence name (chromosome)
4 POS 1-based leftmost mapping position
5 MAPQ Mapping quality: \(-10 \log_{10} P(\text{read is misplaced})\)
6 CIGAR Compact representation of alignment operations (matches, mismatches, indels, soft clips)

The FLAG field is a bitwise integer: each bit encodes a boolean property. FLAG = 83 (binary 1010011) means “paired, mapped in proper pair, mate is reverse strand, and this is read 1.” Understanding FLAG filtering is essential for extracting properly mapped paired-end reads and excluding unpaired, unmapped, or discordant alignments.

The MAPQ score is not a general “confidence” measure — it specifically encodes the probability that the read is placed at the wrong genomic location. A read that maps uniquely to a repetitive region can have high MAPQ but be placed on the wrong repeat copy. A read with MAPQ 0 may be correctly assigned if the true position is ambiguous but the read genuinely comes from that region.

BAM is the binary, compressed version of SAM, using BGZF compression. CRAM achieves 30-60% further compression over BAM by storing only the differences from the reference genome rather than the full aligned sequence. CRAM is the standard for large-scale archival (1000 Genomes, UK Biobank) but requires the reference genome to be available for decompression.

The alignment information in SAM/BAM/CRAM is the product of the search heuristics described in Chapter 8 — the BWT/FM-index, minimizers, and seed-and-extend algorithms that determine where and how each read maps. A BAM file records the output of these algorithms, but interpreting the FLAG and MAPQ fields requires understanding the algorithmic decisions behind them.

Coordinate Systems and Off-by-One Errors

Format Coordinate System Example
BED 0-indexed, half-open chr1 1000 2000 = bases 1000-1999 (length 1000)
GFF/GTF 1-indexed, closed chr1 1000 2000 = bases 1000-2000 (length 1001)
VCF 1-indexed POS = leftmost position of variant
SAM/BAM 1-indexed (POS field) Reference position of leftmost aligned base

Converting between formats requires adding or subtracting 1 from the start coordinate. This is the single most common format-related bug in bioinformatics: a BED-to-GFF conversion that forgets to subtract 1 produces intervals shifted by one base. For a 1 Mb interval this is negligible, but for a transcription factor binding motif (6-12 bp), a 1 bp shift can completely change the motif content of the peak. A typical scenario: a collaborator provides ChIP-seq peaks in BED format; importing them into R with rtracklayer (which defaults to 1-indexed GFF) shifts every peak coordinate by +1, moving the motif entirely out of the peak call for narrow binding events.


Biological Interpretation

The error profile of your sequencer determines the limits of what you can detect. Illumina substitutions are tolerable for germline SNP calling at moderate allele frequencies (20-50% VAF where the signal exceeds the error floor of 0.1%) but catastrophic for detecting rare somatic mutations at 1% VAF — a true variant is indistinguishable from a sequencing error at the same rate. PacBio HiFi indels are systematic (homopolymer) and correctable with moderate coverage (15x+). Nanopore direct RNA-seq can detect base modifications (m6A, pseudouridine) because modified bases produce distinct current patterns during translocation, but requires matched untreated controls to distinguish modifications from basecalling errors.

FASTQ quality scores from different platforms are not directly comparable. An Illumina Q30 and a Nanopore Q30 both mean “1 in 1,000 probability of error,” but the expected structural consequences differ. A read with MAPQ 0 from a DNA aligner (BWA) may map perfectly if aligned with a splice-aware aligner (STAR) — the aligner encodes an assumption about the type of variation expected, and applying the wrong assumption discards valid data.

The practical takeaway: before starting any analysis, document the sequencing platform, its error profile, and the format assumptions of your tools. A pipeline that works for Illumina germline data will silently fail on PacBio HiFi somatic data or Nanopore direct RNA data — not because the code is wrong, but because the error model is wrong.


Current Landscape (Q2 2026)

  • Ultima Genomics and Singular Genomics are challenging Illumina’s dominance with alternative sequencing-by-synthesis chemistries at lower cost per Gbp, but their error profiles are still being characterized by the community.
  • Revio (PacBio) enables HiFi sequencing at scale (thousands of human genomes per year), making long-read sequencing routine for clinical-scale projects.
  • PromethION 2 (Oxford Nanopore) achieves terabyte-scale output per flow cell with improved raw-read accuracy (Q20+), enabling human genome assembly on a single flow cell.
  • POD5 replacing FAST5 for Nanopore data and BAM v2 specification for long reads are simplifying cross-platform analysis standardization as file formats adapt to accommodate new sequencing technologies.

Summary and Required Reading

  1. Platforms differ in error type, not just error rate — Illumina makes substitutions, PacBio makes random indels, Nanopore makes homopolymer errors. The same Q score from different platforms means different things.
  2. Phred scores are logarithmic error probabilities — Q30 = 1/1000 error rate, but the error type (substitution vs. indel) depends on the platform.
  3. File formats encode coordinate systems — BED (0-indexed), GFF/VCF (1-indexed). Off-by-one errors are the most common format-related bug in bioinformatics.
  4. BAM/CRAM are production formats — CRAM stores differences from reference for 30-60% compression over BAM, but requires the reference for decompression.

Required Reading

  • Cock et al.: “The Sanger FASTQ file format for sequences with quality scores” (Nucleic Acids Research, 2010).
  • Li et al.: “The Sequence Alignment/Map format and SAMtools” (Bioinformatics, 2009).

Johnson’s Rule: If you do not know the error profile of your sequencer, you do not know your data.