Computational Analysis of the Taxanomical Classification of Short 16S rRNA Sequences Christel Chehoud Mentor: Brian Haas
Overview Human Microbiome Project 16S rRNA Reference and Test Sets Classifiers Accuracy of Classifications Results
Human Microbiome Project (HMP) Microorganism communities Human development Physiology Immunity Disease Nutrition Core Microbiome
16S rRNA 16S Ribosomal RNA Large RNA component of the small subunit of the ribosome Phylogenetic Markers Species Identification 1542 bp
Using 16S for Species Identification Classifier Sequence Predicted Classification
Project Goal New Sequencing Technology Evaluate the accuracy of the classification of the 16S rRNA across different: Classifiers Regions of the sequence Phylogeny
Reference Dataset RDP Core Set Trusted Taxonomies 6,621 sequences Phylum: 27 Class: 43 Order: 97 Family: 258 Genus: 1352
GreenGenes’s Full Collection of Sequences Full Collection used by GreenGenes High phylogenetic diversity 188,073 sequences 188,073
Comparison of Taxonomy Predictions by Method Classified GreenGenes Core Set Using: RDP (Naïve Bayesian) kmerRank Blast All Match 135,269 sequences Phylum: 27 Class: 43 Order: 96 Family: 257 Genus: , ,073
None Match: BLAST kmerRank RDP None Match 19588
CD-hit: Normalizing Genus Representation 3% difference between genera 21,179 sequences Phylum: 27 Class: 43 Order: 96 Family: 235 Genus: 1241 Li, , ,269 21,179
Sliding Window: Producing our Localized Regions Van de Peer, 1996 Sliding Window Approach 300 bp window 25 bp overlap Sanger vs. 454-XLR = Full-length vs. localized region
Overall Accuracy of the Three Different Classifiers
Average BLASTN:.843 kmerRank:.830 RDP:.831
Overall Accuracy of the Three Different Classifiers Average BLASTN:.843 kmerRank:.830 RDP:.831 Standard Deviation BLASTN:.031 kmerRank:.030 RDP:.017
Genus Prediction Accuracy (per Phylum)
Average BLASTN:.843 kmerRank:.830 RDP:.831 Standard Deviation BLASTN:.107 kmerRank:.153 RDP:.142 Genus Prediction Accuracy (per Phylum)
Finding the 16S Region Providing the Most Reliable Prediction Accuracy
Clustering Phyla and Methods by Prediction Accuracy
Best method is Phylum-dependent Variation in accuracy impacted by depth of species coverage
Summary Central region of 16S is the most accurate, on average Of the methods examined, BLAST is most accurate across all 16S regions and all phyla, on average RDP-bayes is least variable across short sequence regions Best short sequence classification method is phylum-dependent
Acknowledgements Genome Sequencing and Analysis Program Brian Haas Dirk Gevers Michael Feldgarden Doyle Ward Chad Nusbaum Bruce Birren Administration Shawna Young Lucia Vielma Maura Silverstein