B. Knudsen and J. Hein Department of Genetics and Ecology

Slides:



Advertisements
Similar presentations
RNA Secondary Structure Prediction
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Hidden Markov Model in Biological Sequence Analysis – Part 2
Stochastic Context Free Grammars for RNA Modeling CS 838 Mark Craven May 2001.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models Modified from:
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Expected accuracy sequence alignment
Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001.
Improving Free Energy Functions for RNA Folding RNA Secondary Structure Prediction.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Project No. 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Lecture 9 CS5661 RNA – The “REAL nucleic acid” Motivation Concepts Structural prediction –Dot-matrix –Dynamic programming Simple cost model Energy cost.
Improving the prediction of RNA secondary structure by detecting and assessing conserved stems Xiaoyong Fang, et al.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
Expected accuracy sequence alignment Usman Roshan.
INTRODUCTION TO Machine Learning 3rd Edition
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Sequence Alignment.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Machine Learning 5. Parametric Methods.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Expected accuracy sequence alignment Usman Roshan.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Other Models for Time Series. The Hidden Markov Model (HMM)
Poster Design & Printing by Genigraphics ® Esposito, D., Heitsch, C. E., Poznanovik, S. and Swenson, M. S. Georgia Institute of Technology.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Genome Annotation (protein coding genes)
Probability Theory and Parameter Estimation I
Stochastic Context-Free Grammars for Modeling RNA
Learning Sequence Motif Models Using Expectation Maximization (EM)
Data Mining Lecture 11.
Stochastic Context-Free Grammars for Modeling RNA
Hidden Markov Models Part 2: Algorithms
Comparative RNA Structural Analysis
Stochastic Context Free Grammars for RNA Structure Modeling
CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.
Presentation transcript:

RNA Secondary Structure Prediction Using Stochastic Context –Free Grammar And Evolutionary History B. Knudsen and J. Hein Department of Genetics and Ecology The institute of Biological Sciences University of Aarhus, Denmark Presented by Jing Cui Nov.22, 2002

Outline of the Lecture Introduction Results Algorithms Implementation The grammar Probabilities of columns Probabilities of an alignment The full model Implementation The database Frequencies Mutation rates Grammar parameters Results The test sequences Using related sequences Neglecting phylogeny Weight of results Comparison with other methods Conclusion The limitations The improvements

Introduction Single sequence e.g. Zuker(1989) using prior information on RNA structures, through energy functions not ideal when estimating structures of sequences with known homologs Multiple sequences Covariance methods (Eddy and Durbin, 1994) Profile stochastic context-free grammars (SCFGs Sakakibara et al. 1994) Characteristics: do not explicitly take phylogeny into account, and do not use a prior probability distribution of structures Maximum weighted matching methods (Cary and Stormo, 1995; Tabaska et al. 1998) share the above characteristics

The method used in this paper uses prior knowledge about RNA structure in making a maximum a posteriori (MAP) estimation of the 2nd structure. Performed on an alignment of sequences assumed to have identical 2nd structure, i.e. the alignment is assumed to be a structural alignment. Take the phylogenetic tree of the sequences into account, including branch lengths, using a model of mutation processes in RNA. The tree can be estimated by a maximum likelihood (ML) method. Originating from Goldman et al. (1996) Predicting protein 2nd structure using HMMs including phylogenetic information Difference 2nd structure in RNA are not local, like in proteins SCFGs instead of HMM is used here Limitation SCFGs are unable to model crossing interactions, thus pseudoknots cannot be predicted

Algorithm Input an alignment of RNA sequences Output single common structure for the sequences The model The SCFG The evolutionary model

The grammar a set of variables; some terminal and non-terminal

Probabilities of columns Given the tree A column of non-paring bases is independent of the other columns Two paring columns is assumed to be independent of any other columns P = (pA, pU, pG, pC) the distribution of bases in loop regions of RNA sequences The rate matrix Reversibility of mutations For base pair (16 by 16) rate matrix Given a tree, including branch lengths, the column probabilities are calculated using post-order traversal as described by Felsenstein (1981)

Probability of an alignment The input data: D=(C1, C2, …,Cl) The model: M The tree: T 2nd structure: σ s: a single base d: a left column of pairs dc: the right column of the pair

The core model The SCFG The evolutionary model The grammar is equivalent to a grammar that generates column in alignments instead of just secondary structure, meaning that for a two-sequence alignment, the production rule covers the following rules:

The full model The ML estimate of the tree, given the model (If no phylogenetic tree) MAP (Maximum a posteriori) estimation of the most likely 2nd structure by Bayes theorem where P(σ|T,M) is the prior distribution of structures given by the SCFG

Implementation The database The database used for estimating this model should represent RNA structure in general. The database should be composed of various types of RNA. tRNAs database by Sprinzl et al. (1998) ribosomal RNAs (LSU rRNAs) by De Rijk et al. (1998)

Frequencies The single base frequencies were estimated from counts of the bases in the single base positions of the sequences. Base pair frequencies were estimated by counting base pairs.

Mutation rates For a given pair, P, tp : the time between sequences Np: the number of columns in the two-sequence alignment Ps: the prob. of a base being in a single base position

Grammar parameters by inside-outside algorithm (an expectation maximization procedure) on the training set et of secondary structure (Baker, 1979; Lari and Young, 1990) This is just like the forward-backward algorithm in HMM !!!

Results The test sequences 4 bacterial RNase P RNA seq. alignment: 385 columns pair-wise sequence identities 65-92%

Pseudoknot 68-76 and 368-361; 18-12 and 370-364 At least 22 positions wrongly predicted in each sequence

Using related sequences

Weight of results by inside and outside variables, calculate the probability that each position is correctly predicted. How certainty the predictions are, assuming that the model is correct.

Comparison with other methods The energy minimization method has more parameters, better results COVE (Eddy and Durbin, 1994) with lower accuracy This shows the significance of the method described here in situations where only a few sequences are known.

Conclusion Limitations Inability to predict pseudoknots. Loop and stem lengths are assumed to be geometrically distributed The nature of the specific SCFG used here A good alignment is needed – hard to solve The dynamical programming algorithms are relatively slow. [They have a time complexity of O(N3) with respect to the length of the alignment.]

Conclusion Possible improvements Profile SCFGs and covariance models predict 2nd structure at the same time as making alignments Modeling base stacking The evolutionary model Reduce the number of parameters for the rate matrix

Thank you ! Have a nice weekend 