Improving the prediction of RNA secondary structure by detecting and assessing conserved stems Xiaoyong Fang, et al.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

2.3 Modeling Real World Data with Matrices
Identity and Inverse Matrices
CS 450: COMPUTER GRAPHICS LINEAR ALGEBRA REVIEW SPRING 2015 DR. MICHAEL J. REALE.
Chapter 4 Systems of Linear Equations; Matrices
Chapter 4 Systems of Linear Equations; Matrices
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
B.Macukow 1 Lecture 12 Neural Networks. B.Macukow 2 Neural Networks for Matrix Algebra Problems.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Molecular Evolution Revised 29/12/06
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Matrices. Outline What is a matrix? Size of matrices Addition of matrices Scalar multiplication Matrices multiplication.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Table of Contents Matrices - Multiplication Assume that matrix A is of order m  n and matrix B is of order p  q. To determine whether or not A can be.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Chapter 5 Multiple Sequence Alignment.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
WEEK 8 SYSTEMS OF EQUATIONS DETERMINANTS AND CRAMER’S RULE.
Matrix Sparsification. Problem Statement Reduce the number of 1s in a matrix.
Variables Tutorial 3c variable A variable is any symbol that can be replaced with a number to solve a math problem. An open sentence has at least one.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
1 C ollege A lgebra Systems and Matrices (Chapter5) 1.
Algebra 3: Section 5.5 Objectives of this Section Find the Sum and Difference of Two Matrices Find Scalar Multiples of a Matrix Find the Product of Two.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Lecture 9 CS5661 RNA – The “REAL nucleic acid” Motivation Concepts Structural prediction –Dot-matrix –Dynamic programming Simple cost model Energy cost.
Christoph F. Eick: Using EC to Solve Transportation Problems Transportation Problems.
4.4 Identify and Inverse Matrices Algebra 2. Learning Target I can find and use inverse matrix.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
Evaluating Results of Learning Blaž Zupan
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
4.5 Inverse of a Square Matrix
2 2.1 © 2012 Pearson Education, Inc. Matrix Algebra MATRIX OPERATIONS.
8.2 Operations With Matrices
Learning Objectives for Section 4.5 Inverse of a Square Matrix
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Sequence Alignment.
Evaluating Classification Performance
Beyond ab initio modelling… Comparative and Boltzmann equilibrium Yann Ponty, CNRS/Ecole Polytechnique with invaluable help from Alain Denise, LRI/IGM,
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
MATRIX A set of numbers arranged in rows and columns enclosed in round or square brackets is called a matrix. The order of a matrix gives the number of.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
SECTION 2 BINARY OPERATIONS Definition: A binary operation  on a set S is a function mapping S X S into S. For each (a, b)  S X S, we will denote the.
Computer Science and Engineering Computer System Security CSE 5339/7339 Lecture 7 September 9, 2004.
Revision on Matrices Finding the order of, Addition, Subtraction and the Inverse of Matices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Properties and Applications of Matrices
12-1 Organizing Data Using Matrices
Matrices Rules & Operations.
Learning Sequence Motif Models Using Expectation Maximization (EM)
WarmUp 2-3 on your calculator or on paper..
Section 2.4 Matrices.
Matrix Multiplication Sec. 4.2
Presentation transcript:

Improving the prediction of RNA secondary structure by detecting and assessing conserved stems Xiaoyong Fang, et al

Outline Background Related work Methods Results Discussion

Background Non-coding RNAs gained increasing interest since a huge variety of functions associated with them were found The function of an RNA molecule is principally determined by its (secondary) structure Unfortunately, the current physical methods available for structure determination are time-consuming and expensive

cont. The prediction of RNA secondary structure can be facilitated by incorporating with comparative analysis of homologous sequences However, most of existing comparative methods are vulnerable to alignment errors and thus are of low accuracy in practical application And the methods for simultaneous sequence alignment and structure prediction are too computationally taxing to be practical

Related work Zuker, et al: Mfold, 1981, 2003 Knudsen, B. and J. Hein: Pfold, 1999, 2003 Hofacker, I., M. Fekete, and P. Stadler: RNAalifold, 2002 Ruan, J., et al: 2004 Knight, R., et al: 2004 Weinberg, Z. and W.L. Ruzzo: 2004

cont Sankoff, D: 1985 Hofacker, I., S. Bernhart, and P. Stadler: 2004 Mathews, D. and D. Turner: 2002 Mathews, D: 2005 …… etc.

Method 1) We detect possible stems in single RNA sequence using the so-called position matrix with which some possibly paired positions can be uncovered. 2) We detect conserved stems across multiple RNA sequences by multiplying the position matrices. 3) We assess the conserved stems using the Signal-to- Noise. 4) We compute the optimized secondary structure by incorporating the so-called reliable conserved stems with predictions by RNAalifold program.

Given an RNA sequence of length N, Seq, we build one N×N position matrix (denoted by M Seq ) by: 1) The reverse complement of Seq, Seq ’, is firstly computed from the original sequence. 2) We build one N×N matrix containing Seq ’ in the first row. The i-th (0 ≤ i ≤ N-1) row contains the sequence generated from Seq ’ by shifting i position to the left (circular left shift). Detect possible stems in single RNA sequence

3) The position matrix M Seq is computed by comparing Seq with the matrix generated by 2) row by row. 0 or 1 or -1 is assigned to the i, j (0 ≤ j ≤ N-1) element of M Seq by comparing the j-th character of Seq with the i, j (0 ≤ j ≤ N-1) element of the matrix of 2). The following rules should be obeyed when two characters (b and b ’ ) are compared with each other:  If b equals b ’ and neither of them is ‘ _ ’, then 1 is assigned.  If b or b ’ is ‘ _ ’ or both of them are ‘ _ ’, then -1 is assigned.  If b does not equal to b ’ and neither of them is ‘ _ ’, then 0 is assigned. Cont. 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row. The key is to find all zones of continuous “ 1 ” in the matrix. There is a one-to-one mapping between the stems in the sequence and the zones of continuous “ 1 ” in the position matrix. Cont. 2

Example

The multiplying of the position matrices  M [i, j] = M 1 [i, j] × M 2 [i, j];  0×0 = 0; 0×1 = 0; 0× (-1) = 0. 1×1 = 1; 1× (-1) = 1. (-1)× (-1) = -1.  M = Detecting conserved stems across multiple sequences

We detect conserved stems shared by all sequences in an alignment by: 1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above. 2) We select n different stems from n sequences (one stem for one sequence). Here, n is the number of sequences in the alignment. 3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above. Cont. 1

4) We detect conserved stems by finding zones of continuous ‘ 1 ’ in the resulting matrix (M) generated by 3). There is still a one-to-one mapping between the conserved stems and the zones of continuous ‘ 1 ’ in M. 5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected. Cont. 2

Note: Actually, the multiplying of the position matrices can be directly used to detect conserved stems in the RNA alignment. But here, we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences. The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected. Cont. 3

Example Figure 2

The main steps are: 1) We record the number of column pairs (see Figure 3) in the conserved stem as the Signal. 2) We generate a randomized alignment from the original alignment of the conserved stem. 3) We detect possibly conserved stems in the new alignment using the approach mentioned above. Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise. The Signal-to-Noise is thus computed by Signal / Noise. Note that, we set Noise to be 1 if there are no conserved stems in the randomized alignment, and we set Noise to be the maximum if there are more than one conserved stems. Cont.

Example Figure 3

The main steps are: 1) We compute a candidate secondary structure using RNAalifold, which is a popular and powerful program for predicting RNA secondary structure at present. 2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them. 3) We compute the Signal-to-Noise for each conserved stem. 4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise. Computing the optimized secondary structure

Here, reliable conserved stems refer to following two kind of conserved stems:  The conserved stems with very high Signal-to-Noise. They are thought to belong to a real RNA secondary structure in our method.  The conserved stems with very low Signal-to-Noise. They are not thought to belong to any real RNA secondary structure in our method. Cont. 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure. 6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure. 7) We compute the final secondary structure by merging the results of 5) and 6). Cont. 2

Results We construct one data set of three-sequence alignments, one data set of four-sequence alignments and one data set of five- sequence alignments on the basis of Rfam 8.0. We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives, and the specificity as the number of true positives divided by the sum of true positives + false positives 。

Cont. 1 Table 1 - Sensitivity and specificity on data set of three-sequence alignments Id, percentage identity; Se, sensitivity; Sp, specificity. Id (%)Se (%)Sp (%)Se.RNAalifold (%)Sp.RNAalifold (%) < Total

Cont. 2 Table 2 - Sensitivity and specificity on data set of four-sequence alignments Id, percentage identity; Se, sensitivity; Sp, specificity. Id (%)Se (%)Sp (%)Se.RNAalifold (%)Sp.RNAalifold (%) < Total

Cont. 3 Table 3 - Sensitivity and specificity on data set of five-sequence alignments Id, percentage identity; Se, sensitivity; Sp, specificity. Id (%)Se (%)Sp (%)Se.RNAalifold (%)Sp.RNAalifold (%) < Total

Discussion In this paper, we present a stem-based method to improve the prediction of RNA secondary structure. The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment, and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold.

Cont. 1 Our method improves RNAalifold by two major means: 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity; 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity. We tested our method on data sets of RNA alignments taken from the Rfam database. Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold.

Cont. 2 In future work, our method can be improved by: 1) To compute the Signal-to-Noise in more effective ways. 2) To determine the reliable conserved stems by more intelligent approaches. 3) To devise new parallel algorithms to speed up the method. 4) Further benchmark for evaluating the performance of the method.

Thank you very much! Please contact for any