Independent scientist

Independent scientist
STAMPS 2017 Robert Edgar Independent scientist OTUs

Unidentified Taxonomic Objects
Defining, interpreting and assessing next-generation OTUs

WRONG! "Standard" OTU Cluster of reads with 97% identity
Assume that approximately... OTUs ~ species Read abundance ~ species abundance WRONG!

Mock community test 100k reads of 22 strains (bacteria + archaea)
Method OTUs UPARSE 22 QIIME (open-ref) 1,646 mothur (OptiClust) 30,984

Try it yourself! Data from Bokulich et al. Nat. Meth. 2013 After STAMPS -- or in a lab -- try this data with the software package you plan to use. Runs with free USEARCH.

QIIME alpha diversity analysis (rarefaction)
22

Almost all QIIME and mothur
OTUs are errors.

What are OTUs? Operational Taxonomic Unit Sokal & Sneath in 1960s
Numerical taxonomy Matrix of characters Make tree by UPGMA (1958) Not the "right" method any more! Before sequence data and modern phylogenetic tree algorithms Neighbor-Joining, Maximum Likelihood etc.

Bacterial "species" No sexual reproduction
Every mutation = bifurcation daughter cells have SNP, siblings don't (Ignoring lateral gene transfer) Single-cell genome differences 1, 2, many ... LUCA Species concept problematic Strain = few diffs A bit fuzzy, but better than "species" SNP

Operational definition
Arbitrary similarity threshold for species Wayne et al. 1987 Two cultures, similar phenotype DNA recombination >70% = same species

Magic number 97 97% 16S identity ~ 70% DNA recombination
Stackebrandt and Goebel (1994) Modern value 98.5%?

From pair-wise to clustering
FastGroup clustering software for 16S Seguritan and Rohwer 1995 Used 97% threshold citing Stackebrandt & Goebel Propagated to DOTUR, mothur, QIIME...

From pair-wise to clustering
Ideal 97% clustering All pairs in same cluster >97% All pairs in different clusters <97% OTU_1 OTU_2 < 97% > 97%

97% rule doesn't work By the 97% rule... A+B same OTU B+C same OTU A+C different OTU ... oops!

Common in practice Example: Lactobacillus
Lacto species are often >97% Phenotype can vary within a 97% OTU V4 sequences for 37 Lacto. species 47 "impossible" triplets

No "right" answer for 97% OTUs
Agglomerative clustering single-, complete- and average-linkage (UPGMA) OptiClust (Westcott & Schloss 2017) Greedy clustering (not designed for OTUs!) CD-HIT, UCLUST Closed- and open-reference (QIIME) SWARM UPARSE

Can 97% clusters define taxa?
OptiClust paper Westcott & Schloss (2017) Seem to claim that 97% rule defines taxa Matthews Correlation Coefficient (MCC) metric measures OTU clustering "accuracy" OptiClust attempts to maximize MCC

Matthews Correlation Coeff.
TP = number of pairs in the same cluster which have ≥97% identity FN = number of pairs in different clusters which have ≥97% identity TN = number of pairs in different clusters which have <97% identity FP = number of pairs in the same cluster which have <97% identity

Lactobacillus OTUs Five possible sets of clusters 1 2 3 Reasonable
Lump Split Merge closest 4 5 Unreasonable

MCC doesn't work! 1 2 3 4 5

97% OTUs reasonable If sequence errors are not important... But, PCR + NGS cause huge diversity of errors!

Errors trump clustering
Unfiltered errors are a huge problem Filtering errors is much more important than the clustering method

Tolstoy's paradox Even if most bases are good, most unique sequences are bad because all good reads are alike, but every bad read is bad in its own way.

Toy Tolstoy Read length 250nt Very high quality: all bases are Q40
Q40 means Perror = = 1/1000 One letter in a thousand is wrong L=250, so 1,000 letters = four reads One in four reads has a bad base

Toy Tolstoy Make 100 reads of one template 1/4 reads is bad
Say, E. coli 1/4 reads is bad 75 correct reads, 25 bad reads Bad reads different 26 uniques, 1 good 25/26 = 96% bad sequences! 25 bad 75 good

Tolstoy's in practice Q40 is unrealistic
Have to deal with lower-quality reads Vast majority of uniques have errors Most bad reads are harmless One bad base, probably clustered into correct OTU Reads with >3% errors always harmful Reads with <3% errors sometimes harmful Harmful = causes spurious OTU

Tail of error distribution
Average base error rate is misleading Prob. of causing spurious OTU From Edgar & Flyvbjerg 2014

Good reads ... bad OTUs 0.1% bad letters 75% good reads
95% bad sequences Many bad OTUs! Spurious OTUs created by reads with >3% errors Good OTU Good seq.

Paralogs Prokaryotes often have multiple SSU operons
Mean is ~four copies E. coli has seven 16S genes Sequences can vary

PCR chimeras Chimeras made from two templates
Look like valid sequences Often have very few diffs Difficult to detect Unfiltered chimeras cause spurious OTUs

Why make 97% clusters? Merge strains from one species
Who cares about the vague notion of "species"? Strains may have different phenotypes Merge paralogs from one strain Merge bad reads with correct sequence

Mothur OTUs Definition and objectives not clear
OTUs are clusters, no special sequence Doesn't distinguish good / bad sequences Recall Tolstoy's paradox -- almost all sequences are bad! Large majority of OTUs are spurious Inadequate filtering of errors

QIIME OTUs Definition and objectives not clear
Closed-reference and open-reference; more later OTU is represented by one sequence Member (non-representative) sequences discarded Large majority of OTUs are spurious Inadequate filtering of errors Closed-reference often splits species over multiple OTUs

SWARM OTUs Definition and objectives not clear
OTUs "defined" by output of an algorithm complicated connection to biology not explained OTU = representative sequence other sequences discarded Works quite well for smaller read depths

UPARSE OTUs Goal: find a subset of correct biological sequences such that no pair is >97% OTU is a sequence Most abundant in its neighborhood (3%) almost always a correct biological sequence Reads are assigned to most similar OTU >97% not most abundant because identity better signal same strain or species Will describe algorithm later

Denoiser OTUs / ZOTUs / SVs
UNOISE2, DADA2 algorithms Goal: report all correct biological sequences ZOTU = "zero-radius OTU" = 100% identity OTU SV = sequence variant 100% OTU just as useful / valid as 97% OTU Less sensitive to very low-abundance strains Combine samples to boost abundance

Lumping / splitting Assume error-free sequences
All types of OTU split & lump sometimes ZOTUs split more often 97% OTUs lump more often ZOTUs better resolution separates more distinct phenotypes splitting of strain over two ZOTUs benign problem better than merging two phenotypes into one 97% OTU

Denoise, then cluster? Denoising = all correct biological sequences
Ideally, removes all error from the data If it works (it does...), then denoising should be first step in any analysis! Example: denoise then make 97% OTUs But why throw away information? Strains useful: ~same genome & same phenotype Species not very useful concept

Validating OTUs Need definition / goal for OTUs to...
Verify / falsify Classify as correct / incorrect Measure quality as good / mediocre / bad Need test data with known composition Mock communities Mix of known strains Typical ~20, up to ~100

Westcott & Schloss 2017 mSphere
Use MCC to measure OTU "accuracy" Assume MCC obvious / uncontroversial, but... MCC doesn't work UPARSE better in my opinion... ...so MCC is controversial as a gold standard Don't test on mock data Test data very noisy due to poor filtering 31,000 OTUs from 22 strains!?

Boklulich et al. 2013 Nat. Methods

Boklulich et al. Read quality filtering Make OTUs with UCLUST
Reject reads with runs of bases with Q≤3 (Perror > 0.5) Make OTUs with UCLUST Discard OTUs with abundance < c (0.005%)

Boklulich et al. RDP OTUs (same genus) Richness = number of RDP OTUs,
not UCLUST OTUs! UCLUST OTUs Discard OTU if RDP Classifier does not predict a genus, about 50% of OTUs. Unstated filtering step not used in practice.

Boklulich et al. Claim that richness is good because number of RDP OTUs is ~ number of strains Nonsense -- doesn't work on real data! Most genera don't have names Richness by RDP OTUs is vastly under-estimated Thousands of UCLUST OTUs Richness by UCLUST OTUs is vastly over-estimated

Boklulich et al. How to measure richness in practice?
No answer in the paper. Default QIIME filtering not Bokulich et al.'s OTU abundance threshold (c) not used

Bokulich et al. True -- and their own filtering strategy has this exact problem! Low-abundance reads are discarded with low-abundance OTUs (their c parameter).

Mock testing by richness
Number of OTUs ~ number of strains Risk of over-fitting parameters Right number for wrong reason example: Kozich et al. 2013 cross-talk, contaminants

Mock testing by OTU sequence
Category Description Perfect 100% identical to biological sequence. Good ≥99% identical to biological sequence. Could be chimera and / or noisy read. Noisy ≥97% identical to biological sequence. Could be chimera and / or noisy read. Chimera "Bad" chimera >3% from biological sequence Contaminant Sequence found in large ref. db. (SILVA) Other None of the above. Could be a novel contaminant, or -- much more likely -- have >3% errors.

HMP mock results Edgar Nat. Meth. (2013)

Denoiser results, "Extreme" mock
Edgar 2016 (UNOISE2 paper) 5 / 27 = 20% contaminants

Spurious OTUs in mock vs. real
Mock data has low diversity ~10x to 100x lower than typical real samples Are bad test results artifact of low diversity? I believe not -- two arguments

Argument #1 Each subset of 20 species is like adding another mock sample with lower abundance. Make, say, 10,000 reads of these species Similar rates of errors due to PCR & sequencing Similar number of spurious OTUs from these 20 species regardless of what else is sequenced

Argument #2 Each new read is good or bad
Prob(good), Prob(bad) roughly constant Prob(creates new bad OTU) roughly constant Low, but we have millions of reads! Unlikely to reproduce existing bad OTU If errors random, bad reads are different Number of spurious OTUs = Prob(creates new bad OTU) x (number of reads) = constant x (number of reads) Diversity irrelevant

Stable OTUs Stable A given sequence will always be assigned to the same OTU. A given sequence may be assigned to different OTUs depending on which other sequences are present. OTUs made by clustering are usually unstable by this definition. Unstable

Stable OTUs Stable OTUs enable comparison across different experiments without re-clustering. but only if the same segment is sequenced, e.g. V4 Desirable attribute But, if most OTUs are errors, then dealing with spurious OTUs more important!

Stable OTUs Algorithm Stable? Denoising: UNOISE2, DADA2 Yes
Closed-reference Open-reference No Greedy: CD-HIT, UCLUST, UPARSE Agglomerative: UPGMA etc. SWARM

Closed-reference OTUs
QIIME script pick_closed_reference_otus.py Greengenes clustered at 97% (GG97) ~100k sequences Each sequence in GG97 defines one OTU Algorithm Reads searched against GG97 Read assigned to OTU if >97% similar Otherwise discarded ("fail")

Claimed closed-ref advantages
Fast Stable Compare different hypervariable regions Assign accurate taxonomy

Closed-ref. mock tests Dataset Strains OTUs Bokulich et al. 2013 22
955 Kozich et al. 2013 21 5,839 Extreme 26 343 Strains are split over many OTUs when there are sequence errors.

Different V region test, HMP mock
Idealized test using known mock sequences as input. Models best-case scenario with no sequence errors, e.g. good denoiser. Table entries are GG97 OTU identifiers assigned to each species. Edgar 2017 (submitted)

Closed-reference Idealized test with perfect sequences
Strains often split over several OTUs Some strains fail All present in Greengenes Fail because full-length >97% but segment <97% Different segments assigned to different OTUs Cannot compare different hypervariable regions

Closed-reference taxonomy
Dataset Sensitivity Errors Bokulich et al. 2013 95% 61% Kozich et al. 2013 84% 93% Extreme 55% 46% Sensitivity = fraction of all mock genus names which were predicted. Errors = fraction of all predicted genus names not in mock (could be contaminants / cross-talk)

Open-reference clustering
QIIME recommend method First, closed-reference Second, cluster the "fails" with UCLUST Discard singletons OTUs with only one sequence

Open-ref. mock tests Dataset Strains Closed Open Bokulich et al. 2013
22 955 4,482 Kozich et al. 2013 21 5,839 10,217 Extreme 26 343 298

Independent scientist

Similar presentations

Presentation on theme: "Independent scientist"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Independent scientist

Similar presentations

Presentation on theme: "Independent scientist"— Presentation transcript:

Similar presentations

About project

Feedback