18-21 August 2009 The Biosphere
18-21 August 2009 Secondary structure of small subunit ribosomal RNA 5' end 3' end Image adapted from R. Gutell
18-21 August 2009 Unaligned rRNA sequences in a multiple alignment editor
18-21 August 2009 Aligned rRNA sequences in editor
18-21 August 2009 Secondary structure of small subunit ribosomal RNA 5' end 3' end Image adapted from R. Gutell
18-21 August 2009 The 530 Loop of E. coli Stem with canonical Watson-Crick base pairing Bulge Non-canonical G-U basepair Loop
18-21 August loop of E.coli & T.jannaschii
18-21 August 2009 The 530 loop structure of six species 1
18-21 August 2009 Six taxa showing aligned 530 loop region of the 16S rRNA
18-21 August 2009 Simlarity matrices comparing the 530 loop sequences and the full rRNA sequences of the six listed taxa A. Similarity matrix for 530 loop B. Similarity matrix for complete 16S rRNA
18-21 August 2009 The Biosphere E.coli AqxPyrop T.jannaschii P.freundenreichii M.vannielii S.solfa
18-21 August 2009 Acknowledgement of rRNA secondary structure image: Cannone J.J., Subramanian S., Schnare M.N., Collett J.R., D'Souza L.M., Du Y., Feng B., Lin N., Madabusi L.V., Müller K.M., Pande N., Shang Z., Yu N., and Gutell R.R. (2002). The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and Other RNAs. BioMed Central Bioinformatics, 3:2. [Correction: BioMed Central Bioinformatics. 3:15.] Smith T.F., Gutell R., Lee J., and Hartman H The origin and evolution of the ribosome. Biology Direct, 3:16. Woese CR Bacterial evolution. Microbiol Rev (2): Zuckerkandl E, Pauling L Molecules as documents of evolutionary history. J Theor Biol. 8(2): Cole, J., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R., Kulam-Syed-Mohideen, A., McGarrell, D., Marsh, T., Garrity, G. and Tiedje, J. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acid Research In press. References
18-21 August 2009 Sequence Alignment Accuracy, Time, Memory
18-21 August 2009 Multiple Sequence Alignment Pairwise dynamic programming –Smith-Waserman, Needleman Wunsch –Can be transformed into probabilistic framework Multidimensional dynamic programming –Not practical Progressive alignment –Muscle, ClustalW –Both are progressive iterative
18-21 August 2009 BLAST Heuristic search strategy Locate high-scoring short matches –3aa or 5 to 11 bases Extend short matches Determine significance using extreme value distribution statistics
18-21 August 2009 BLAST (cont.) E value –Database dependent Bits –Database independent % Similarity (identity) –For aligned segment s –NOT overall % identity
18-21 August 2009 Model Based Alignment Profile Hidden Markov Models –Protein and nucleic acid –Models primary sequence Stochastic Context-Free Grammars –Incorporates RNA secondary structure
18-21 August 2009 Profile HMM
18-21 August 2009 Hidden Markov Model
18-21 August 2009 Hidden Markov Model
18-21 August 2009 Hidden Markov Model
18-21 August 2009
2D Structure Conserved from Domain to Family Diagrams from the Gutell Lab Comparative RNA Web Site (
18-21 August 2009 SCFG rRNA Model
18-21 August 2009 SCFG Limitations Model primary and secondary structure –Can’t model pseudoknots or higher-order interactions Time complexity O(ML 3 ) –Solved by Nawroki et al. Space complexity O(ML 2 ) –Est 16 GB memory for rRNA –Solved by Eddy Partial sequences –Disrupt internal alignment –Solved by Nawrorki et al.
18-21 August 2009
Aligner References MUSCLE BLAST HMMER INFERNAL
18-21 August 2009 Distance Calculation Phylogenetic methods only score base substitution, not insertion or deletion. Score comparable positions –Mask out unaligned regions, insertions –Ignore positions with deletion
18-21 August 2009 Other Common Distances Hamming distance –No gap - insert –Original Blast Edit distance –Penalize for gaps –RDP Probe Match Matching word percentage (q-gram) –Does not require alignment –RDP Sequence Match
18-21 August 2009 Clustering Accuracy, Time, Memory
18-21 August 2009 Unsupervised Classification (Clustering) Hierarchical Agglomerative –Single Linkage (Nearest neighbor) –Average Linkage (UPGMA) –Compete Linkage (Furthest Neighbor) Partitional Clustering –K-Means –Not often used in this field Self Organizing Maps –Using word frequency
18-21 August 2009 Hierarchical Clustering ≤0.03 Complete Linkage Single Linkage
18-21 August 2009
FastGroupII
18-21 August 2009 Supervised Classification K-Nearest Neighbors –SeqMatch, Megan, easyTaxon –Last Common Ancestor Bayesian –RDP Classifier Kernel methods –Support Vector Machines
18-21 August 2009
RDP-II Screenshots fast search algorithm, limit searches to sequences spanning specific regions, change depth and edit distance fast search algorithm, limit searches to sequences spanning specific regions, change depth and edit distance place sequences into bacterial taxonomy, works well with partial or full-length sequences, bootstrap confidence estimate, prior alignment not required place sequences into bacterial taxonomy, works well with partial or full-length sequences, bootstrap confidence estimate, prior alignment not required finds nearest neighbor, more accurate than BLAST, uses “q-gram” matching method finds nearest neighbor, more accurate than BLAST, uses “q-gram” matching method
18-21 August 2009 RDP Pyrosequencing Pipeline Tools for high-throughput analysis
18-21 August 2009 Thirty-One Years of rRNA Sequencing
Twenty-Eight Years Later Proc. Natl. Acad. Sci., USA Vol. 103, No. 32, pp , August
18-21 August 2009 Multiplexed Amplicon Pyrosequencing
18-21 August 2009 RDP Pyrosequencing Pipeline
18-21 August 2009 Initial Processing Steps Sort by barcode (key) Quality filter –Forward & (optional) reverse primers –Ambiguities –Length Trim key & primer sequences
18-21 August 2009 Taxonomy Independent Global Alignment Cluster Based OTU Assignment Standard Ecological Metrics Many 3rd Party Data Formats Taxonomy Dependent RDP Classifier Sequence Match Many 3rd Party Data Formats Two Analysis Tracks
18-21 August 2009 Infernal Aligner –(Nawrocki and Eddy. 2007, PLoS Comput Biol) Fast - 500/min Probabilistic Model –Model describes shared features Incorporates 2d Structure –Cannone et al. 2002, BioMed Central Bioinformatics Model Based Alignment
18-21 August 2009 Complete Linkage Clustering (Operational Taxonomic Units) Distance based method Guaranteed intra-cluster distance N 2 algorithm Current online limit 150,000 unique reads Memory-efficient version in testing ≤0.03
18-21 August 2009 RDP Naive Bayesian Classifier Fast /min Places sequences into bacterial taxonomy Works well on partial or full-length sequences Does not require alignment Easily re-trained to match new taxonomies Bootstrap confidence estimates Online GUI - Soap service - Open source
18-21 August 2009 From Wang et. al., AEM, 2007 Classifier Accuracy on 200 bp Regions
18-21 August 2009 RDP Classifier Bootstrap Performance (Genus Level - Short Reads) V3V6V4 Bootstrap cutoff0%50%80%0%50%80%0%50%80% Human Gut % classified % matching Soil % classified % matching