Independent scientist

Independent scientist
STAMPS 2017 Robert Edgar Independent scientist Next-Gen Reads

FASTQ files Text file with four lines per read
Label Sequence + Q scores Format not fully standardized Different conventions for representing Q scores as letters Software may have different max & min Q scores Typical is Q2 to Q40

Illumina read Index (barcode) MiSeq#141 Spot coords. 1 = R1 2 = R2
1. Label 2. Sequence 3. + 4. Quals

Demultiplexing Assign reads to samples
Using barcode sequence (tag, index) Embedded in PCR primer Incorporated into sequencing construct Illumina: usually separate read Single- or dual-index One or two index reads 454, Ion, PacBio... usually in the read

Illumina demultiplexing
-rw-rw-r-- 1 robert robert 60M Apr 14 07:29 Sample1_S17_L001_R1_001.fastq -rw-rw-r-- 1 robert robert 31M Apr 14 07:28 Sample1_S17_L001_R2_001.fastq -rw-rw-r-- 1 robert robert 43M Apr 14 07:28 Sample2_S18_L001_R1_001.fastq -rw-rw-r-- 1 robert robert 23M Apr 14 07:28 Sample2_S18_L001_R2_001.fastq -rw-rw-r-- 1 robert robert 50M Apr 14 07:29 Sample3_S19_L001_R1_001.fastq -rw-rw-r-- 1 robert robert 26M Apr 14 07:28 Sample3_S19_L001_R2_001.fastq Linux directory listing with read files created by MiSeq FASTQ = read file format R1 = forward read R2 = reverse read R1 + R2 for each sample Barcodes from index reads identified and discarded by Illumina software Remaining sequence is 16S (you should delete primer-binding bases)

Quality (Phred) scores
Integer Q2 .. Q40 Represents P_error, probability base is wrong Q40: P_error = % good Q30: P_error = % good Q20: P_error = % good Q10: P_error = % wrong Q3: P_error = % wrong Q2: P_error = % wrong!!

Quality filtering Discard poor-quality data
Poor quality = high probability of error(s) low Q scores Genomics can mask out low-Q positions e.g. for SNP-calling

Quality filtering Amplicon sequencing different scenario
Need pair-wise comparisons for most analyses pairs of reads, or reads & database to calculate identity or determine if sequences identical Masked / ambiguous positions (Ns) problematic Variable length (e.g. truncated at low Q) also problematic OTU clustering "Harmful" reads >3% errors create spurious OTUs High diversity in harmful reads Many spurious OTUs even if harmful reads small fraction

Truncating at low Q is bad idea
Read quality often falls towards end of read Popular (but bad!) to truncate when Q low QIIME, sickle Do A and B have identical sequences? If Yes, dubious tail gets high abundance If No, good prefix gets low abundance

Global trimming Similar/identical reads should be globally alignable with few/no terminal gaps Comparisons unambiguous Cannot have A identical (or >97% similar) to prefix of B Unpaired reads: truncate to fixed length Trim low-quality tails Important for 454, Ion Torrent, Pac Bio Often not needed for Illumina

Global trimming Full-length amplicons with varying length ok
e.g. overlapping paired reads trim to primers ok no terminal gaps when same / closely related Fwd V4 Rev Fwd V4 Rev Fwd V4 Rev Fwd V4 Rev

Quality filtering methods
þ Minimum Q Ok if Q is large, e.g. Q≥20 (P_error=1%), not small e.g. Q3 Don't truncate at variable lengths Unnecessarily stringent, expected errors is better ý Average Q over sliding window Sickle Math mistake -- averaging logarithms makes no sense Errors dominated by small Qs, lost by averaging

Quality filtering methods
ý QIIME filter Truncate ( ) read if >3 consecutive bases with Q≤3 Q=3 means P_error = 50% Allows reads with many errors! ý PANDAseq method t = geometric mean (!?) of P_correct along read ≥ 0.6 P_error = 0.4 Much too high, allows reads with many errors Better with higher t, but not as good as expected errors See Edgar & Flyvbjerg 2014 for details

þ Expected error filtering
A T C 20 3 10 40 25 2 Avg. Q = 23 (Perror = 0.005), seems pretty good... But, the most probable number of errors = 1 because of the Q3 and Q2 bases.

Expected errors Expected errors = mean 50% errs

Expected errors Expected errors (E) in a read
E = mean over large set with random errors according to Q scores real-valued (because it's an average) always > 0 can be < 1

Expected errors Surprisingly easy to calculate E
Sum the error probabilities E = sum P_error Most probable number of errors E* E* = largest integer ≤ E = floor(E) Proofs in Edgar & Flyvbjerg 2014.

Expected error filter Discard reads with E>1
Keep reads with E*=0 Most probable number of errors = zero Typical performance on MiSeq 2x250 V4 keeps ~75% of the reads 2/3 of filtered reads are correct (zero errors) 1/3 have one or more bad bases

Expected error filter Works well if Q scores are accurate
Illumina Q scores are pretty good 454, I-T, PacBio not so good Q scores predict indel errors, not substitutions filtering not so effective expected error filter still best method Max E=1 suggested default Not a requirement! (note for comparative validation) Larger E for less stringent filtering Smaller E for very stringent filtering

Expected error filtering
Critics: allege too stringent high cost in sensitivity, diversity Reads are not lost! Most filtered reads map to OTUs after clustering Filtering is critically important to suppress spurious OTUs High sensitivity to rare species not possible Contaminants, cross-talk... Limit of resolution abundance > ~0.5% of reads

Expected vs. measured errors

Quality filter performance
QIIME and PANDAseq filters leave tens of thousands of reads with >3% errors, thousands of spurious OTUs

Paired read merging / assembly

Paired read merging Two observations of each base in overlap
Should increase/decrease Q if match/mismatch Use Bayes' Theorem to get posterior P_error Correct equations in Edgar & Flyvbjerg (2015) Previous papers got this wrong! Aligned region has highest quality R1s good quality R2s lower quality Position in read (MiSeq 2x300 after merging by USEARCH)

Paired read assemblers

Do we need full overlap? V4 is ~250nt 2x250 PE reads give full overlap
Better error correction? Accurate OTUs with UPARSE on R2s only! Longer amplicons ok, e.g. V3-V4 (400nt) better resolution

Dereplication Find the unique sequences in the reads
and their abundances Abundance is a very useful signal High-abundance sequences almost certainly correct if (and only if!) globally trimmed Errors increasingly common at lower abundances Combine reads from all samples Strongest abundance signal

Singletons Abundance = 1 Random errors usually singletons
Not usually reproduced by chance Systematic errors may have ab. > 1 Polymerase errors & chimeras (amplified by PCR) Sequencing error usually pretty random

Discard singletons After filtering, many reads with >3% errors
Sequencer error Polymerase copying errors Chimeras Most of these are singletons Discard singletons before clustering Necessary to minimize spurious OTUs Most singletons map to OTUs after clustering, not lost!

Discard singletons Critics: allege high cost in sensitivity, diversity
Effect on sensitivity minimal / meaningless By definition, found once in one sample! Ecologically irrelevant (or not possible to interpret) Sensitivity is < 100% even with singletons Sampling effects, e.g. rare species missed Primer mismatches ("universal" = ~80% - 90%) Some / many rare species missing regardless Estimators like Chao-1 nonsense for 16S

Delete primer-binding sequences
PCR tends to substitute mismatches Delete n bases if primer is length n USEARCH fastx_truncate command

Independent scientist

Similar presentations

Presentation on theme: "Independent scientist"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Independent scientist

Similar presentations

Presentation on theme: "Independent scientist"— Presentation transcript:

Similar presentations

About project

Feedback