Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Sequencing Sample with bacteria.

Similar presentations


Presentation on theme: "Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Sequencing Sample with bacteria."— Presentation transcript:

1 Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Sequencing Sample with bacteria

2 Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Amplified segments Biological sequences Chimeric artifacts formed from ≥2 biological sequences during PCR Sequencing Sample with bacteria

3 Biological sequences Chimeric artifacts formed from ≥2 biological sequences during PCR Clustering Biological OTUs Chimeric OTUs

4 From Haas et al. (2011) Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons, Genome Research.

5  Usually in region of high sequence similarity  Similarity usually due to homology  But not always!  Non-homologous cross-over  Chimera formed from single parent  Looks like deletion or tandem duplication

6  Proportional to parent abundance  Proportional to sequence similarity P(Parent1, Parent2) ∝ (proportional to) abundance(Parent1) × abundance(Parent2) × sequence_similarity(Parent1, Parent2) (cross-over at matching k-mer)

7 Bimera (two segments) Trimera (three segments) Trimera (three segs., two parents)

8 From Lahr and Katz (2009), Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase, Biotechniques. 2-meras 3-meras 4-meras = (nr segments – 1)

9  Homologous cross-over  Chimeras look like biological sequences  Often align well to reference sequences  How to distinguish?  Next-gen read 100 – 400nt  3% divergence = 3 – 12 diffs  Small amount of evidence

10  Reference database  Match segments to known parents  De novo  Find chimeric alignments (A-B-C)  Chimera is least abundant  UCHIME Edgar et al. (2001) UCHIME improves speed and sensitivity of chimera detection, Bioinformatics.

11  Haas et al. 2011.  Find 50-mers unique to single genus  Chimera if 50-mers indicate > 1 genus  Low sensitivity, genus level only

12  Ashelford et al. 2005 At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microbiol 71: 7724–7736  Find closest reference sequence(s)  Measure divergence in sliding window (300nt)  Compare with avg. variability in 16S gene  Conserved vs. variable regions  Anomaly (chimera) if variability far from avg.  Newer algorithms work better

13  ChimeraSlayer  Haas et al. 2011  Similar to UCHIME reference database mode  Perseus  Quince et al. 2011 Removing Noise From Pyrosequenced Amplicons,BMC Bioinformatics.  Similar to UCHIME de novo mode

14

15 Query Split into four chunks Search database Save top hits Hits A A B B Query Find & align closest pair (A, B)

16  User provides reference database  Should be high-quality sequences  Believed to be chimera-free  Advantages:  High confidence in predictions  Disadvantages:  Expect high false-negative rate  Ref DB usually doesn't cover all possible parents

17  Parents amplified more than chimera  At least one more round  So parents at least 2x more abundant  "Abundance skew" >= 2 (user-settable)  Input is estimated amplicons + abundances  NOT reads!

18  Sort amplicons by decreasing abundance  Start with empty DB  For each amplicon:  Search DB for parents with >= 2x abundance  If chimeric hit: ▪ Classify as chimera and discard query  If not chimeric hit: ▪ Add to reference DB

19 Ref DBDe novo Hits found by both Hits found by ref DB only (rare?) Hits found by de novo only (common?

20  Two modes check each other  De novo should have better coverage  All parents should be present  Should examine hits found by ref DB but not by de novo  See UCHIME manual for more discussion.

21 A 81 CCTTGGTAGGCCGtTGCCCTGCCAACTA GCTAATCAGACGC gggtCCATCtcaCACCaccggAgtTTTtcTCaCTgTacc 160 Q 81 CCTTGGTAGGCCGCTGCCCTGCCAACTA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAATCTCTTTCAG 160 B 81 TCTTGGTgGGCCGtTaCCCcGCCAACaA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAAaCTCTTTCAG 160 Diffs A A p A A A BBBB BBB BBBBB BB BBa B B BBB Votes Y Y A Y Y Y YYYY YYY YYYYY YY YYN Y Y YYY Model AAAAAAAAAAAAAAAAAAAAAAAAAAAA xxxxxxxxxxxxx BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB No voteAbstain vote

22  Model = segment of A + segment of B  Chimeric if model closer to Q than A or B  Left closer to A and right closer to B  Closer if Y > N  Ratio Y/N > 1 A 81 CCTTGGTAGGCCGtTGCCCTGCCAACTA GCTAATCAGACGC gggtCCATCtcaCACCaccggAgtTTTtcTCaCTgTacc 160 Q 81 CCTTGGTAGGCCGCTGCCCTGCCAACTA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAATCTCTTTCAG 160 B 81 TCTTGGTgGGCCGtTaCCCcGCCAACaA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAAaCTCTTTCAG 160 Diffs A A p A A A BBBB BBB BBBBB BB BBa B B BBB Votes Y Y A Y Y Y YYYY YYY YYYYY YY YYN Y Y YYY Model AAAAAAAAAAAAAAAAAAAAAAAAAAAA xxxxxxxxxxxxx BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

23 H = Y / [β (N + n) + A] Observations: Y =Yes votes, N =No votes, A =abstain votes Parameters: β=weight of No vote, n=prior number of No votes. Larger score = more likely to be chimera Default: chimera if H Left x H Right ≥ 0.3 Threshold 0.3 adjusts sensitivity vs. false-positive rate.

24  Real communities  Don't know all the biological sequences  Can't distinguish chimeras from real (that's the problem!)  Mock community  Do you really know all the 16S sequences -- no!  Communities too "easy"  To few biological sequences, too well separated  Simulation  How realistic is it -- we don't know.  No definitive validation

25

26 Length 300 simulated bimeras with 0 - 5% mutations.

27 Length 300 simulated multimeras with 1% substitutions.

28 Noisy regions align well enough

29

30

31

32

33  Open-source version  Source code donated to public domain  USEARCH version  Leverages proprietary algorithms  10x or more faster than open-source version

34  100x faster than Perseus  1,000x faster than ChimeraSlayer  USEARCH version 10x or more faster again.

35  Many subtle issues  Read manual & Supp. Mat. carefully!

36  Database incomplete  Missing species  Missing paralogs  16S duplications are common  Probably high rates of false negatives

37  Parents should be present  Probably low false negative rate vs. ref db.  False positive rate not well known  Mock community validation may be optimistic  Error correction required  Input MUST be amplicons & abundances  Usually means starting from raw reads  Cannot use on processed seqs. (e.g. RDP)

38  Convergent evolution in different clades  Different rates in different regions  Biological chimeras  Bad sequences  Bad alignments

39 Bad A Good A Errors

40  Full-length gene  Shotgun fragments  Paired-end reads  Reference database method  De novo mode not possible with shotgun

41  Screen each end separately  Using standard UCHIME in ref db. mode  For each end E1,E2  Find closest parent P1,P2

42 P2 P1 P2 Gap d 1 using P1 Gap d 2 using P2 E1E2 NNNN... Pad gap with (d 1 + d 2 )/2 Ns E2 E1

43  UCHIME ref db. on padded pair.  Ns don't count as diffs. E1E2 NNNN...

44  Ref db. mode can be run with any set of seqs.  Later is more efficient (fewer seqs.)  De novo mode no choice:  Requires full set of amplicons & abundances  Must follow denoise/error correction step  De novo first  Ref db. second  Two modes can check each other

45  Based on USEARCH/UCLUST/UCHIME  Described in afternoon talk on clustering.

46 http://drive5.com/uchime

47


Download ppt "Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Sequencing Sample with bacteria."

Similar presentations


Ads by Google