18-21 August 2009 The Biosphere. 18-21 August 2009 Secondary structure of small subunit ribosomal RNA 5' end 3' end Image adapted from R. Gutell

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Molecular Evolution Revised 29/12/06
Structural bioinformatics
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Heuristic alignment algorithms and cost matrices
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Sequence Analysis Tools
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Systematics, Taxonomy, Phylogeny and Evolution Systematics The systematic classification of organisms, the science of systematic classification and the.
Protein Sequence Alignment and Database Searching.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Christian Rinke Microbial Genomics DOE, Joint Genome Institute Introduction to ARB (From A User's Perspective)
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Copyright OpenHelix. No use or reproduction without express written consent1.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Accurate estimation of microbial communities using 16S tags
Construction of Substitution matrices
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Poster Design & Printing by Genigraphics ® Esposito, D., Heitsch, C. E., Poznanovik, S. and Swenson, M. S. Georgia Institute of Technology.
Biology 224 Instructor: Tom Peavy October 18 & 20, Multiple Sequence.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Introduction to Profile HMMs
Introduction to Bioinformatics Resources for DNA Barcoding
Multiple sequence alignment (msa)
Identifying templates for protein modeling:
Dot Plots, Path Matrices, Score Matrices
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

18-21 August 2009 The Biosphere

18-21 August 2009 Secondary structure of small subunit ribosomal RNA 5' end 3' end Image adapted from R. Gutell

18-21 August 2009 Unaligned rRNA sequences in a multiple alignment editor

18-21 August 2009 Aligned rRNA sequences in editor

18-21 August 2009 Secondary structure of small subunit ribosomal RNA 5' end 3' end Image adapted from R. Gutell

18-21 August 2009 The 530 Loop of E. coli Stem with canonical Watson-Crick base pairing Bulge Non-canonical G-U basepair Loop

18-21 August loop of E.coli & T.jannaschii

18-21 August 2009 The 530 loop structure of six species 1

18-21 August 2009 Six taxa showing aligned 530 loop region of the 16S rRNA

18-21 August 2009 Simlarity matrices comparing the 530 loop sequences and the full rRNA sequences of the six listed taxa A. Similarity matrix for 530 loop B. Similarity matrix for complete 16S rRNA

18-21 August 2009 The Biosphere E.coli AqxPyrop T.jannaschii P.freundenreichii M.vannielii S.solfa

18-21 August 2009 Acknowledgement of rRNA secondary structure image: Cannone J.J., Subramanian S., Schnare M.N., Collett J.R., D'Souza L.M., Du Y., Feng B., Lin N., Madabusi L.V., Müller K.M., Pande N., Shang Z., Yu N., and Gutell R.R. (2002). The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and Other RNAs. BioMed Central Bioinformatics, 3:2. [Correction: BioMed Central Bioinformatics. 3:15.] Smith T.F., Gutell R., Lee J., and Hartman H The origin and evolution of the ribosome. Biology Direct, 3:16. Woese CR Bacterial evolution. Microbiol Rev (2): Zuckerkandl E, Pauling L Molecules as documents of evolutionary history. J Theor Biol. 8(2): Cole, J., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R., Kulam-Syed-Mohideen, A., McGarrell, D., Marsh, T., Garrity, G. and Tiedje, J. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acid Research In press. References

18-21 August 2009 Sequence Alignment Accuracy, Time, Memory

18-21 August 2009 Multiple Sequence Alignment Pairwise dynamic programming –Smith-Waserman, Needleman Wunsch –Can be transformed into probabilistic framework Multidimensional dynamic programming –Not practical Progressive alignment –Muscle, ClustalW –Both are progressive iterative

18-21 August 2009 BLAST Heuristic search strategy Locate high-scoring short matches –3aa or 5 to 11 bases Extend short matches Determine significance using extreme value distribution statistics

18-21 August 2009 BLAST (cont.) E value –Database dependent Bits –Database independent % Similarity (identity) –For aligned segment s –NOT overall % identity

18-21 August 2009 Model Based Alignment Profile Hidden Markov Models –Protein and nucleic acid –Models primary sequence Stochastic Context-Free Grammars –Incorporates RNA secondary structure

18-21 August 2009 Profile HMM

18-21 August 2009 Hidden Markov Model

18-21 August 2009 Hidden Markov Model

18-21 August 2009 Hidden Markov Model

18-21 August 2009

2D Structure Conserved from Domain to Family Diagrams from the Gutell Lab Comparative RNA Web Site (

18-21 August 2009 SCFG rRNA Model

18-21 August 2009 SCFG Limitations Model primary and secondary structure –Can’t model pseudoknots or higher-order interactions Time complexity O(ML 3 ) –Solved by Nawroki et al. Space complexity O(ML 2 ) –Est 16 GB memory for rRNA –Solved by Eddy Partial sequences –Disrupt internal alignment –Solved by Nawrorki et al.

18-21 August 2009

Aligner References MUSCLE BLAST HMMER INFERNAL

18-21 August 2009 Distance Calculation Phylogenetic methods only score base substitution, not insertion or deletion. Score comparable positions –Mask out unaligned regions, insertions –Ignore positions with deletion

18-21 August 2009 Other Common Distances Hamming distance –No gap - insert –Original Blast Edit distance –Penalize for gaps –RDP Probe Match Matching word percentage (q-gram) –Does not require alignment –RDP Sequence Match

18-21 August 2009 Clustering Accuracy, Time, Memory

18-21 August 2009 Unsupervised Classification (Clustering) Hierarchical Agglomerative –Single Linkage (Nearest neighbor) –Average Linkage (UPGMA) –Compete Linkage (Furthest Neighbor) Partitional Clustering –K-Means –Not often used in this field Self Organizing Maps –Using word frequency

18-21 August 2009 Hierarchical Clustering ≤0.03 Complete Linkage Single Linkage

18-21 August 2009

FastGroupII

18-21 August 2009 Supervised Classification K-Nearest Neighbors –SeqMatch, Megan, easyTaxon –Last Common Ancestor Bayesian –RDP Classifier Kernel methods –Support Vector Machines

18-21 August 2009

RDP-II Screenshots fast search algorithm, limit searches to sequences spanning specific regions, change depth and edit distance fast search algorithm, limit searches to sequences spanning specific regions, change depth and edit distance place sequences into bacterial taxonomy, works well with partial or full-length sequences, bootstrap confidence estimate, prior alignment not required place sequences into bacterial taxonomy, works well with partial or full-length sequences, bootstrap confidence estimate, prior alignment not required finds nearest neighbor, more accurate than BLAST, uses “q-gram” matching method finds nearest neighbor, more accurate than BLAST, uses “q-gram” matching method

18-21 August 2009 RDP Pyrosequencing Pipeline Tools for high-throughput analysis

18-21 August 2009 Thirty-One Years of rRNA Sequencing

Twenty-Eight Years Later Proc. Natl. Acad. Sci., USA Vol. 103, No. 32, pp , August

18-21 August 2009 Multiplexed Amplicon Pyrosequencing

18-21 August 2009 RDP Pyrosequencing Pipeline

18-21 August 2009 Initial Processing Steps Sort by barcode (key) Quality filter –Forward & (optional) reverse primers –Ambiguities –Length Trim key & primer sequences

18-21 August 2009 Taxonomy Independent Global Alignment Cluster Based OTU Assignment Standard Ecological Metrics Many 3rd Party Data Formats Taxonomy Dependent RDP Classifier Sequence Match Many 3rd Party Data Formats Two Analysis Tracks

18-21 August 2009 Infernal Aligner –(Nawrocki and Eddy. 2007, PLoS Comput Biol) Fast - 500/min Probabilistic Model –Model describes shared features Incorporates 2d Structure –Cannone et al. 2002, BioMed Central Bioinformatics Model Based Alignment

18-21 August 2009 Complete Linkage Clustering (Operational Taxonomic Units) Distance based method Guaranteed intra-cluster distance N 2 algorithm Current online limit 150,000 unique reads Memory-efficient version in testing ≤0.03

18-21 August 2009 RDP Naive Bayesian Classifier Fast /min Places sequences into bacterial taxonomy Works well on partial or full-length sequences Does not require alignment Easily re-trained to match new taxonomies Bootstrap confidence estimates Online GUI - Soap service - Open source

18-21 August 2009 From Wang et. al., AEM, 2007 Classifier Accuracy on 200 bp Regions

18-21 August 2009 RDP Classifier Bootstrap Performance (Genus Level - Short Reads) V3V6V4 Bootstrap cutoff0%50%80%0%50%80%0%50%80% Human Gut % classified % matching Soil % classified % matching