RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

RNA Secondary Structure Prediction
101 The Cocke-Kasami-Younger Algorithm An example of bottom-up parsing, for CFG in Chomsky normal form G :S  AB | BB A  CC | AB | a B  BB | CA | b C.
Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.
Stochastic Context Free Grammars for RNA Modeling CS 838 Mark Craven May 2001.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 6, Thursday April 17, 2003
CS262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
CS Master – Introduction to the Theory of Computation Jan Maluszynski - HT Lecture 4 Context-free grammars Jan Maluszynski, IDA, 2007
CpG islands in DNA sequences
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Some new sequencing technologies
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Hidden Markov Models.
RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.
CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.
More on Text Management. Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Project No. 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
CONVERTING TO CHOMSKY NORMAL FORM
RNA folding & ncRNA discovery I519 Introduction to Bioinformatics, Fall, 2012 Adapted from Haixu Tang.
Some Probability Theory and Computational models A short overview.
CSCI 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Ambiguity.
Phrase-structure grammar A phrase-structure grammar is a quadruple G = (V, T, P, S) where V is a finite set of symbols called nonterminals, T is a set.
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
CS5263 Bioinformatics RNA Secondary Structure Prediction.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
RNA folding & ncRNA discovery
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
Expected accuracy sequence alignment Usman Roshan.
Stochastic Context Free Grammars CBB 261 for noncoding RNA gene prediction B. Majoros.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
The estimation of stochastic context-free grammars using the Inside-Outside algorithm Oh-Woog Kwon KLE Lab. CSE POSTECH.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Other Models for Time Series. The Hidden Markov Model (HMM)
RNA secondary structure Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3- Combinatorial Motif Finding Lecture 4-Statistical Motif Finding.
RNAs. RNA Basics transfer RNA (tRNA) transfer RNA (tRNA) messenger RNA (mRNA) messenger RNA (mRNA) ribosomal RNA (rRNA) ribosomal RNA (rRNA) small interfering.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Stochastic Context-Free Grammars for Modeling RNA
Lecture 21 RNA Secondary Structure Prediction
Stochastic Context-Free Grammars for Modeling RNA
Dynamic Programming (cont’d)
Comparative RNA Structural Analysis
RNA folding & ncRNA discovery
Stochastic Context Free Grammars for RNA Structure Modeling
CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.
The Cocke-Kasami-Younger Algorithm
Presentation transcript:

RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc

Hairpin Loops Stems Bulge loop Interior loops Multi-branched loop

Context Free Grammars and RNAs S  a W 1 u W 1  c W 2 g W 2  g W 3 c W 3  g L c L  agucg What if the stem loop can have other letters in place of the ones shown? ACGG UGCC AG U CG

The Nussinov Algorithm and CFGs Define the following grammar, with scores: S  a S u : 3 | u S a : 3 g S c : 2 | c S g : 2 g S u : 1 | u S g : 1 S S : 0 | a S : 0 | c S : 0 | g S : 0 | u S : 0 |  : 0 Note:  is the “” string Then, the Nussinov algorithm finds the optimal parse of a string with this grammar

The Nussinov Algorithm Initialization: F(i, i-1) = 0; for i = 2 to N F(i, i) = 0;for i = 1 to NS  a | c | g | u Iteration: For l = 2 to N: For i = 1 to N – l j = i + l – 1 F(i+1, j -1) + s(x i, x j ) S  a S u | … F(i, j) = max max{ i  k < j } F(i, k) + F(k+1, j) S  S S Termination: Best structure is given by F(1, N)

Stochastic Context Free Grammars In an analogy to HMMs, we can assign probabilities to transitions: Given grammar X 1  s 11 | … | s in … X m  s m1 | … | s mn Can assign probability to each rule, s.t. P(X i  s i1 ) + … + P(X i  s in ) = 1

Computational Problems Calculate an optimal alignment of a sequence and a SCFG (DECODING) Calculate Prob[ sequence | grammar ] (EVALUATION) Given a set of sequences, estimate parameters of a SCFG (LEARNING)

Evaluation Recall HMMs: Forward:f l (i) = P(x 1 …x i,  i = l) Backward:b k (i) = P(x i+1 …x N |  i = k) Then, P(x) =  k f k (N) a k0 =  l a 0l e l (x 1 ) b l (1) Analogue in SCFGs: Inside:a(i, j, V) = P(x i …x j is generated by nonterminal V) Outside: b(i, j, V) = P(x, excluding x i …x j is generated by S and the excluded part is rooted at V)

Normal Forms for CFGs Chomsky Normal Form: X  YZ X  a All productions are either to 2 nonterminals, or to 1 terminal Theorem (technical) Every CFG has an equivalent one in Chomsky Normal Form (That is, the grammar in normal form produces exactly the same set of strings)

Example of converting a CFG to C.N.F. S  ABC A  Aa | a B  Bb | b C  CAc | c Converting: S  AS’ S’  BC A  AA | a B  BB | b C  DC’ | c C’  c D  CA S A BC A a a B b B b b C A c ca S A S’ BC AA aa BB BB b b b D C’ C A c ca

Another example S  ABC A  C | aA B  bB | b C  cCd | c Converting: S  AS’ S’  BC A  C’C’’ | c | A’A A’  a B  B’B | b B’  b C  C’C’’ | c C’  c C’’  CD D  d

The Inside Algorithm To compute a(i, j, V) = P(x i …x j, produced by V) a(i, j, v) =  X  Y  k a(i, k, X) a(k+1, j, Y) P(V  XY) kk+1 i j V XY

Algorithm: Inside Initialization: For i = 1 to N, V a nonterminal, a(i, i, V) = P(V  x i ) Iteration: For i = 1 to N-1 For j = i+1 to N For V a nonterminal a(i, j, V) =  X  Y  k a(i, k, X) a(k+1, j, X) P(V  XY) Termination: P(x |  ) = a(1, N, S)

The Outside Algorithm b(i, j, V) = Prob(x 1 …x i-1, x j+1 …x N, where the “gap” is rooted at V) Given that V is the right-hand-side nonterminal of a production, b(i, j, V) =  X  Y  k<i a(k, i-1, X) b(k, j, Y) P(Y  XV) i j V k X Y

Algorithm: Outside Initialization: b(1, N, S) = 1 For any other V, b(1, N, V) = 0 Iteration: For i = 1 to N-1 For j = N down to i For V a nonterminal b(i, j, V) =  X  Y  k<i a(k, i-1, X) b(k, j, Y) P(Y  XV) +  X  Y  k<i a(j+1, k, X) b(i, k, Y) P(Y  VX) Termination: It is true for any i, that: P(x |  ) =  X b(i, i, X) P(X  x i )

Learning for SCFGs We can now estimate c(V) = expected number of times V is used in the parse of x 1 ….x N 1 c(V) = ––––––––  1  i  N  i  j  N a(i, j, V) b(i, j, v) P(x |  ) 1 c(V  XY) = ––––––––  1  i  N  i<j  N  i  k<j b(i,j,V) a(i,k,X) a(k+1,j,Y) P(V  XY) P(x |  )

Learning for SCFGs Then, we can re-estimate the parameters with EM, by: c(V  XY) P new (V  XY) = –––––––––––– c(V) c(V  a)  i: xi = a b(i, i, V) P(V  a) P new (V  a) = –––––––––– = –––––––––––––––––––––––––––––––– c(V)  1  i  N  i<j  N a(i, j, V) b(i, j, V)

Decoding: the CYK algorithm Given x = x 1....x N, and a SCFG G, Find the most likely parse of x (the most likely alignment of G to x) Dynamic programming variable:  (i, j, V):likelihood of the most likely parse of x i …x j, rooted at nonterminal V Then,  (1, N, S): likelihood of the most likely parse of x by the grammar

The CYK algorithm (Cocke-Younger-Kasami) Initialization: For i = 1 to N, any nonterminal V,  (i, i, V) = log P(V  x i ) Iteration: For i = 1 to N-1 For j = i+1 to N For any nonterminal V,  (i, j, V) = max X max Y max i  k<j  (i,k,X) +  (k+1,j,Y) + log P(V  XY) Termination: log P(x | ,  * ) =  (1, N, S) Where  * is the optimal parse tree (if traced back appropriately from above)

Summary: SCFG and HMM algorithms GOALHMM algorithmSCFG algorithm Optimal parseViterbiCYK EstimationForwardInside BackwardOutside LearningEM: Fw/BckEM: Ins/Outs Memory ComplexityO(N K)O(N 2 K) Time ComplexityO(N K 2 )O(N 3 K 3 ) Where K: # of states in the HMM # of nonterminals in the SCFG

A SCFG for predicting RNA structure S  a S | c S | g S | u S |   S a | S c | S g | S u  a S u | c S g | g S u | u S g | g S c | u S a  SS Adjust the probability parameters to be the ones reflecting the relative strength/weakness of bonds, etc. Note: this algorithm does not model loop size!

CYK for RNA folding Can do faster than O(N 3 K 3 ): Initialization:  (i, i-1) = -Infinity  (i, i) = log P( x i S ) Iteration: For i = 1 to N-1 For j = i+1 to N  (i+1, j-1) + log P(x i S x j )  (i+1, j) + log P(S x i )  (i, j) = max  (i, j-1) + log P(x i S) max i < k < j  (i, k) +  (k+1, j) + log P(S S)

The Zuker algorithm – main ideas Models energy of an RNA fold 1.Instead of base pairs, pairs of base pairs (more accurate) 2.Separate score for bulges 3.Separate score for different-size & composition loops 4.Separate score for interactions between stem & beginning of loop Can also do all that with a SCFG, and train it on real data

Methods for inferring RNA fold Experimental: –Crystallography –NMR Computational –Fold prediction (Nussinov, Zuker, SCFGs) –Multiple Alignment

Multiple alignment and RNA folding Given K homologous aligned RNA sequences: Human aagacuucggaucuggcgacaccc Mouse uacacuucggaugacaccaaagug Worm aggucuucggcacgggcaccauuc Fly ccaacuucggauuuugcuaccaua Orc aagccuucggagcgggcguaacuc If i th and j th positions are always base paired and covary, then they are likely to be paired

Mutual information f ab (i,j) M ij =  a,b  {a,c,g,u} f ab (i,j) log 2 –––––––––– f a (i) f b (j) Where f ab (i,j) is the # of times the pair a, b are in positions i, j Given a multiple alignment, can infer structure that maximizes the sum of mutual information, by DP In practice: 1.Get multiple alignment 2.Find covarying bases – deduce structure 3.Improve multiple alignment (by hand) 4.Go to 2 A manual EM process!!

Current state, future work The Zuker folding algorithm can predict good folded structures To detect RNAs in a genome –Can ask whether a given sequence folds well – not very reliable For tRNAs (small, typically ~60 nt; well-conserved structure) Covariance Model of tRNA (like a SCFG) detects them well Difficult to make efficient probabilistic models of larger RNAs Not known how to efficiently do folding and multiple alignment simultaneously