Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Sequencing Sample with bacteria.

Slides:



Advertisements
Similar presentations
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Advertisements

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
An Introduction to Bioinformatics
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Construction of Substitution Matrices
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Elucidating factors behind pair wise distances discrepancies between short and near full-length sequences. We hypothesized that since the 16S rRNA molecule.
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Accurate estimation of microbial communities using 16S tags
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Copyright OpenHelix. No use or reproduction without express written consent1.
(H)MMs in gene prediction and similarity searches.
What is BLAST? Basic BLAST search What is BLAST?
Canadian Bioinformatics Workshops
Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Robert Edgar Independent scientist
16S rRNA Experimental Design
What is BLAST? Basic BLAST search What is BLAST?
Virginia Commonwealth University
Biotechnology.
Metagenomic Species Diversity.
Micelle PCR reduces artifact formation in 16S microbiota profiling
Preprocessing Data Rob Schmieder.
Evolution of gene function
Basics of BLAST Basic BLAST Search - What is BLAST?
Research in Computational Molecular Biology , Vol (2008)
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Independent scientist
2nd (Next) Generation Sequencing
Independent scientist
Dr Tan Tin Wee Director Bioinformatics Centre
BLAST.
Identification of Paralogs in RADseq data
Phylogenetic footprinting and shadowing
First Draft of Chimpanzee Genome
Basic Local Alignment Search Tool (BLAST)
Independent scientist
Example of amplicon performance in our presented workflow.
DeltaV Neural - Expert In Expert mode, the user can select the training parameters, recommend you use the defaults for most applications.
Canadian Bioinformatics Workshops
DNA Profiling Vocabulary
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Sequencing Sample with bacteria

Bacterial chromosome 16S rRNA gene   Primers 16S rRNA gene segments PCR Amplified segments Biological sequences Chimeric artifacts formed from ≥2 biological sequences during PCR Sequencing Sample with bacteria

Biological sequences Chimeric artifacts formed from ≥2 biological sequences during PCR Clustering Biological OTUs Chimeric OTUs

From Haas et al. (2011) Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons, Genome Research.

 Usually in region of high sequence similarity  Similarity usually due to homology  But not always!  Non-homologous cross-over  Chimera formed from single parent  Looks like deletion or tandem duplication

 Proportional to parent abundance  Proportional to sequence similarity P(Parent1, Parent2) ∝ (proportional to) abundance(Parent1) × abundance(Parent2) × sequence_similarity(Parent1, Parent2) (cross-over at matching k-mer)

Bimera (two segments) Trimera (three segments) Trimera (three segs., two parents)

From Lahr and Katz (2009), Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase, Biotechniques. 2-meras 3-meras 4-meras = (nr segments – 1)

 Homologous cross-over  Chimeras look like biological sequences  Often align well to reference sequences  How to distinguish?  Next-gen read 100 – 400nt  3% divergence = 3 – 12 diffs  Small amount of evidence

 Reference database  Match segments to known parents  De novo  Find chimeric alignments (A-B-C)  Chimera is least abundant  UCHIME Edgar et al. (2001) UCHIME improves speed and sensitivity of chimera detection, Bioinformatics.

 Haas et al  Find 50-mers unique to single genus  Chimera if 50-mers indicate > 1 genus  Low sensitivity, genus level only

 Ashelford et al At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microbiol 71: 7724–7736  Find closest reference sequence(s)  Measure divergence in sliding window (300nt)  Compare with avg. variability in 16S gene  Conserved vs. variable regions  Anomaly (chimera) if variability far from avg.  Newer algorithms work better

 ChimeraSlayer  Haas et al  Similar to UCHIME reference database mode  Perseus  Quince et al Removing Noise From Pyrosequenced Amplicons,BMC Bioinformatics.  Similar to UCHIME de novo mode

Query Split into four chunks Search database Save top hits Hits A A B B Query Find & align closest pair (A, B)

 User provides reference database  Should be high-quality sequences  Believed to be chimera-free  Advantages:  High confidence in predictions  Disadvantages:  Expect high false-negative rate  Ref DB usually doesn't cover all possible parents

 Parents amplified more than chimera  At least one more round  So parents at least 2x more abundant  "Abundance skew" >= 2 (user-settable)  Input is estimated amplicons + abundances  NOT reads!

 Sort amplicons by decreasing abundance  Start with empty DB  For each amplicon:  Search DB for parents with >= 2x abundance  If chimeric hit: ▪ Classify as chimera and discard query  If not chimeric hit: ▪ Add to reference DB

Ref DBDe novo Hits found by both Hits found by ref DB only (rare?) Hits found by de novo only (common?

 Two modes check each other  De novo should have better coverage  All parents should be present  Should examine hits found by ref DB but not by de novo  See UCHIME manual for more discussion.

A 81 CCTTGGTAGGCCGtTGCCCTGCCAACTA GCTAATCAGACGC gggtCCATCtcaCACCaccggAgtTTTtcTCaCTgTacc 160 Q 81 CCTTGGTAGGCCGCTGCCCTGCCAACTA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAATCTCTTTCAG 160 B 81 TCTTGGTgGGCCGtTaCCCcGCCAACaA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAAaCTCTTTCAG 160 Diffs A A p A A A BBBB BBB BBBBB BB BBa B B BBB Votes Y Y A Y Y Y YYYY YYY YYYYY YY YYN Y Y YYY Model AAAAAAAAAAAAAAAAAAAAAAAAAAAA xxxxxxxxxxxxx BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB No voteAbstain vote

 Model = segment of A + segment of B  Chimeric if model closer to Q than A or B  Left closer to A and right closer to B  Closer if Y > N  Ratio Y/N > 1 A 81 CCTTGGTAGGCCGtTGCCCTGCCAACTA GCTAATCAGACGC gggtCCATCtcaCACCaccggAgtTTTtcTCaCTgTacc 160 Q 81 CCTTGGTAGGCCGCTGCCCTGCCAACTA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAATCTCTTTCAG 160 B 81 TCTTGGTgGGCCGtTaCCCcGCCAACaA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAAaCTCTTTCAG 160 Diffs A A p A A A BBBB BBB BBBBB BB BBa B B BBB Votes Y Y A Y Y Y YYYY YYY YYYYY YY YYN Y Y YYY Model AAAAAAAAAAAAAAAAAAAAAAAAAAAA xxxxxxxxxxxxx BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

H = Y / [β (N + n) + A] Observations: Y =Yes votes, N =No votes, A =abstain votes Parameters: β=weight of No vote, n=prior number of No votes. Larger score = more likely to be chimera Default: chimera if H Left x H Right ≥ 0.3 Threshold 0.3 adjusts sensitivity vs. false-positive rate.

 Real communities  Don't know all the biological sequences  Can't distinguish chimeras from real (that's the problem!)  Mock community  Do you really know all the 16S sequences -- no!  Communities too "easy"  To few biological sequences, too well separated  Simulation  How realistic is it -- we don't know.  No definitive validation

Length 300 simulated bimeras with 0 - 5% mutations.

Length 300 simulated multimeras with 1% substitutions.

Noisy regions align well enough

 Open-source version  Source code donated to public domain  USEARCH version  Leverages proprietary algorithms  10x or more faster than open-source version

 100x faster than Perseus  1,000x faster than ChimeraSlayer  USEARCH version 10x or more faster again.

 Many subtle issues  Read manual & Supp. Mat. carefully!

 Database incomplete  Missing species  Missing paralogs  16S duplications are common  Probably high rates of false negatives

 Parents should be present  Probably low false negative rate vs. ref db.  False positive rate not well known  Mock community validation may be optimistic  Error correction required  Input MUST be amplicons & abundances  Usually means starting from raw reads  Cannot use on processed seqs. (e.g. RDP)

 Convergent evolution in different clades  Different rates in different regions  Biological chimeras  Bad sequences  Bad alignments

Bad A Good A Errors

 Full-length gene  Shotgun fragments  Paired-end reads  Reference database method  De novo mode not possible with shotgun

 Screen each end separately  Using standard UCHIME in ref db. mode  For each end E1,E2  Find closest parent P1,P2

P2 P1 P2 Gap d 1 using P1 Gap d 2 using P2 E1E2 NNNN... Pad gap with (d 1 + d 2 )/2 Ns E2 E1

 UCHIME ref db. on padded pair.  Ns don't count as diffs. E1E2 NNNN...

 Ref db. mode can be run with any set of seqs.  Later is more efficient (fewer seqs.)  De novo mode no choice:  Requires full set of amplicons & abundances  Must follow denoise/error correction step  De novo first  Ref db. second  Two modes can check each other

 Based on USEARCH/UCLUST/UCHIME  Described in afternoon talk on clustering.