Presentation is loading. Please wait.

Presentation is loading. Please wait.

Independent scientist

Similar presentations


Presentation on theme: "Independent scientist"— Presentation transcript:

1 Independent scientist
STAMPS 2017 Robert Edgar Independent scientist OTUs

2 Unidentified Taxonomic Objects
Defining, interpreting and assessing next-generation OTUs

3 WRONG! "Standard" OTU Cluster of reads with 97% identity
Assume that approximately... OTUs ~ species Read abundance ~ species abundance WRONG!

4 Mock community test 100k reads of 22 strains (bacteria + archaea)
Method OTUs UPARSE 22 QIIME (open-ref) 1,646 mothur (OptiClust) 30,984

5 Try it yourself! Data from Bokulich et al. Nat. Meth. 2013 After STAMPS -- or in a lab -- try this data with the software package you plan to use. Runs with free USEARCH.

6 QIIME alpha diversity analysis (rarefaction)
22

7 Almost all QIIME and mothur
OTUs are errors.

8 What are OTUs? Operational Taxonomic Unit Sokal & Sneath in 1960s
Numerical taxonomy Matrix of characters Make tree by UPGMA (1958) Not the "right" method any more! Before sequence data and modern phylogenetic tree algorithms Neighbor-Joining, Maximum Likelihood etc.

9 Bacterial "species" No sexual reproduction
Every mutation = bifurcation daughter cells have SNP, siblings don't (Ignoring lateral gene transfer) Single-cell genome differences 1, 2, many ... LUCA Species concept problematic Strain = few diffs A bit fuzzy, but better than "species" SNP

10 Operational definition
Arbitrary similarity threshold for species Wayne et al. 1987 Two cultures, similar phenotype DNA recombination >70% = same species

11 Magic number 97 97% 16S identity ~ 70% DNA recombination
Stackebrandt and Goebel (1994) Modern value 98.5%?

12 From pair-wise to clustering
FastGroup clustering software for 16S Seguritan and Rohwer 1995 Used 97% threshold citing Stackebrandt & Goebel Propagated to DOTUR, mothur, QIIME...

13 From pair-wise to clustering
Ideal 97% clustering All pairs in same cluster >97% All pairs in different clusters <97% OTU_1 OTU_2 < 97% > 97%

14 97% rule doesn't work By the 97% rule... A+B same OTU B+C same OTU A+C different OTU ... oops!

15 Common in practice Example: Lactobacillus
Lacto species are often >97% Phenotype can vary within a 97% OTU V4 sequences for 37 Lacto. species 47 "impossible" triplets

16 No "right" answer for 97% OTUs
Agglomerative clustering single-, complete- and average-linkage (UPGMA) OptiClust (Westcott & Schloss 2017) Greedy clustering (not designed for OTUs!) CD-HIT, UCLUST Closed- and open-reference (QIIME) SWARM UPARSE

17 Can 97% clusters define taxa?
OptiClust paper Westcott & Schloss (2017) Seem to claim that 97% rule defines taxa Matthews Correlation Coefficient (MCC) metric measures OTU clustering "accuracy" OptiClust attempts to maximize MCC

18 Matthews Correlation Coeff.
TP = number of pairs in the same cluster which have ≥97% identity FN = number of pairs in different clusters which have ≥97% identity TN = number of pairs in different clusters which have <97% identity FP = number of pairs in the same cluster which have <97% identity

19 Lactobacillus OTUs Five possible sets of clusters 1 2 3 Reasonable
Lump Split Merge closest 4 5 Unreasonable

20 MCC doesn't work! 1 2 3 4 5

21 97% OTUs reasonable If sequence errors are not important... But, PCR + NGS cause huge diversity of errors!

22 Errors trump clustering
Unfiltered errors are a huge problem Filtering errors is much more important than the clustering method

23 Tolstoy's paradox Even if most bases are good, most unique sequences are bad because all good reads are alike, but every bad read is bad in its own way.

24 Toy Tolstoy Read length 250nt Very high quality: all bases are Q40
Q40 means Perror = = 1/1000 One letter in a thousand is wrong L=250, so 1,000 letters = four reads One in four reads has a bad base

25 Toy Tolstoy Make 100 reads of one template 1/4 reads is bad
Say, E. coli 1/4 reads is bad 75 correct reads, 25 bad reads Bad reads different 26 uniques, 1 good 25/26 = 96% bad sequences! 25 bad 75 good

26 Tolstoy's in practice Q40 is unrealistic
Have to deal with lower-quality reads Vast majority of uniques have errors Most bad reads are harmless One bad base, probably clustered into correct OTU Reads with >3% errors always harmful Reads with <3% errors sometimes harmful Harmful = causes spurious OTU

27 Tail of error distribution
Average base error rate is misleading Prob. of causing spurious OTU From Edgar & Flyvbjerg 2014

28 Good reads ... bad OTUs 0.1% bad letters 75% good reads
95% bad sequences Many bad OTUs! Spurious OTUs created by reads with >3% errors Good OTU Good seq.

29 Paralogs Prokaryotes often have multiple SSU operons
Mean is ~four copies E. coli has seven 16S genes Sequences can vary

30 PCR chimeras Chimeras made from two templates
Look like valid sequences Often have very few diffs Difficult to detect Unfiltered chimeras cause spurious OTUs

31 Why make 97% clusters? Merge strains from one species
Who cares about the vague notion of "species"? Strains may have different phenotypes Merge paralogs from one strain Merge bad reads with correct sequence

32 Mothur OTUs Definition and objectives not clear
OTUs are clusters, no special sequence Doesn't distinguish good / bad sequences Recall Tolstoy's paradox -- almost all sequences are bad! Large majority of OTUs are spurious Inadequate filtering of errors

33 QIIME OTUs Definition and objectives not clear
Closed-reference and open-reference; more later OTU is represented by one sequence Member (non-representative) sequences discarded Large majority of OTUs are spurious Inadequate filtering of errors Closed-reference often splits species over multiple OTUs

34 SWARM OTUs Definition and objectives not clear
OTUs "defined" by output of an algorithm complicated connection to biology not explained OTU = representative sequence other sequences discarded Works quite well for smaller read depths

35 UPARSE OTUs Goal: find a subset of correct biological sequences such that no pair is >97% OTU is a sequence Most abundant in its neighborhood (3%) almost always a correct biological sequence Reads are assigned to most similar OTU >97% not most abundant because identity better signal same strain or species Will describe algorithm later

36 Denoiser OTUs / ZOTUs / SVs
UNOISE2, DADA2 algorithms Goal: report all correct biological sequences ZOTU = "zero-radius OTU" = 100% identity OTU SV = sequence variant 100% OTU just as useful / valid as 97% OTU Less sensitive to very low-abundance strains Combine samples to boost abundance

37 Lumping / splitting Assume error-free sequences
All types of OTU split & lump sometimes ZOTUs split more often 97% OTUs lump more often ZOTUs better resolution separates more distinct phenotypes splitting of strain over two ZOTUs benign problem better than merging two phenotypes into one 97% OTU

38 Denoise, then cluster? Denoising = all correct biological sequences
Ideally, removes all error from the data If it works (it does...), then denoising should be first step in any analysis! Example: denoise then make 97% OTUs But why throw away information? Strains useful: ~same genome & same phenotype Species not very useful concept

39 Validating OTUs Need definition / goal for OTUs to...
Verify / falsify Classify as correct / incorrect Measure quality as good / mediocre / bad Need test data with known composition Mock communities Mix of known strains Typical ~20, up to ~100

40 Westcott & Schloss 2017 mSphere
Use MCC to measure OTU "accuracy" Assume MCC obvious / uncontroversial, but... MCC doesn't work UPARSE better in my opinion... ...so MCC is controversial as a gold standard Don't test on mock data Test data very noisy due to poor filtering 31,000 OTUs from 22 strains!?

41 Boklulich et al. 2013 Nat. Methods

42 Boklulich et al. Read quality filtering Make OTUs with UCLUST
Reject reads with runs of bases with Q≤3 (Perror > 0.5) Make OTUs with UCLUST Discard OTUs with abundance < c (0.005%)

43 Boklulich et al. RDP OTUs (same genus) Richness = number of RDP OTUs,
not UCLUST OTUs! UCLUST OTUs Discard OTU if RDP Classifier does not predict a genus, about 50% of OTUs. Unstated filtering step not used in practice.

44 Boklulich et al. Claim that richness is good because number of RDP OTUs is ~ number of strains Nonsense -- doesn't work on real data! Most genera don't have names Richness by RDP OTUs is vastly under-estimated Thousands of UCLUST OTUs Richness by UCLUST OTUs is vastly over-estimated

45 Boklulich et al. How to measure richness in practice?
No answer in the paper. Default QIIME filtering not Bokulich et al.'s OTU abundance threshold (c) not used

46 Bokulich et al. True -- and their own filtering strategy has this exact problem! Low-abundance reads are discarded with low-abundance OTUs (their c parameter).

47 Mock testing by richness
Number of OTUs ~ number of strains Risk of over-fitting parameters Right number for wrong reason example: Kozich et al. 2013 cross-talk, contaminants

48 Mock testing by OTU sequence
Category Description Perfect 100% identical to biological sequence. Good ≥99% identical to biological sequence. Could be chimera and / or noisy read. Noisy ≥97% identical to biological sequence. Could be chimera and / or noisy read. Chimera "Bad" chimera >3% from biological sequence Contaminant Sequence found in large ref. db. (SILVA) Other None of the above. Could be a novel contaminant, or -- much more likely -- have >3% errors.

49 HMP mock results Edgar Nat. Meth. (2013)

50 Denoiser results, "Extreme" mock
Edgar 2016 (UNOISE2 paper) 5 / 27 = 20% contaminants

51 Spurious OTUs in mock vs. real
Mock data has low diversity ~10x to 100x lower than typical real samples Are bad test results artifact of low diversity? I believe not -- two arguments

52 Argument #1 Each subset of 20 species is like adding another mock sample with lower abundance. Make, say, 10,000 reads of these species Similar rates of errors due to PCR & sequencing Similar number of spurious OTUs from these 20 species regardless of what else is sequenced

53 Argument #2 Each new read is good or bad
Prob(good), Prob(bad) roughly constant Prob(creates new bad OTU) roughly constant Low, but we have millions of reads! Unlikely to reproduce existing bad OTU If errors random, bad reads are different Number of spurious OTUs = Prob(creates new bad OTU) x (number of reads) = constant x (number of reads) Diversity irrelevant

54 Stable OTUs Stable A given sequence will always be assigned to the same OTU. A given sequence may be assigned to different OTUs depending on which other sequences are present. OTUs made by clustering are usually unstable by this definition. Unstable

55 Stable OTUs Stable OTUs enable comparison across different experiments without re-clustering. but only if the same segment is sequenced, e.g. V4 Desirable attribute But, if most OTUs are errors, then dealing with spurious OTUs more important!

56 Stable OTUs Algorithm Stable? Denoising: UNOISE2, DADA2 Yes
Closed-reference Open-reference No Greedy: CD-HIT, UCLUST, UPARSE Agglomerative: UPGMA etc. SWARM

57 Closed-reference OTUs
QIIME script pick_closed_reference_otus.py Greengenes clustered at 97% (GG97) ~100k sequences Each sequence in GG97 defines one OTU Algorithm Reads searched against GG97 Read assigned to OTU if >97% similar Otherwise discarded ("fail")

58 Claimed closed-ref advantages
Fast Stable Compare different hypervariable regions Assign accurate taxonomy

59 Closed-ref. mock tests Dataset Strains OTUs Bokulich et al. 2013 22
955 Kozich et al. 2013 21 5,839 Extreme 26 343 Strains are split over many OTUs when there are sequence errors.

60 Different V region test, HMP mock
Idealized test using known mock sequences as input. Models best-case scenario with no sequence errors, e.g. good denoiser. Table entries are GG97 OTU identifiers assigned to each species. Edgar 2017 (submitted)

61 Closed-reference Idealized test with perfect sequences
Strains often split over several OTUs Some strains fail All present in Greengenes Fail because full-length >97% but segment <97% Different segments assigned to different OTUs Cannot compare different hypervariable regions

62 Closed-reference taxonomy
Dataset Sensitivity Errors Bokulich et al. 2013 95% 61% Kozich et al. 2013 84% 93% Extreme 55% 46% Sensitivity = fraction of all mock genus names which were predicted. Errors = fraction of all predicted genus names not in mock (could be contaminants / cross-talk)

63 Open-reference clustering
QIIME recommend method First, closed-reference Second, cluster the "fails" with UCLUST Discard singletons OTUs with only one sequence

64 Open-ref. mock tests Dataset Strains Closed Open Bokulich et al. 2013
22 955 4,482 Kozich et al. 2013 21 5,839 10,217 Extreme 26 343 298


Download ppt "Independent scientist"

Similar presentations


Ads by Google