Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki.

Similar presentations


Presentation on theme: "RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki."— Presentation transcript:

1 RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki Seki and Tadao Kasami Graduate School of Information Science, Nara Institute of Science and Technology (NAIST)

2 2 NAIST

3 3 Table of Contents Background Grammatical approach to RNA structure modeling Model Stochastic multiple context-free grammar Algorithms Parsing and parameter estimation Experimental results RNA pseudoknot prediction Summary

4 4 RNA Secondary Structure: Stem-Loop 5’—C A A U G A C—3’ CG UA U C A C U U C A U C A G A A A A U G A C nested Connect base pairs with arcs. Loop Stem Complementary base pairs AU GC

5 5 Modeling RNA Secondary Structure by Context-Free Grammar (CFG) RNA secondary structure can be modeled by parse structure of CFG. Structure prediction Parsing Example of CFG rules: U U C A U C A G A A Secondary structure S S S S u u c a u c a g a a Derivation tree

6 6 RNA Secondary Structure: Pseudoknot CFGs cannot represent pseudoknots. 5’—C U U C A A G A C U U G A C—3’ A A A C U U C A U C A G A A A A U G A C crossed Connect base pairs with arcs.

7 7 Early Studies Brown & Wilson / Cai et al. Rivas & Eddy / Uemura et al. Matsui et al. Generative power Based on CFG > CFG Metric for optimization Stochastic grammar Free energy Stochastic grammar Time complexity O(n 3 ) / O(n 6 ) O(n 6 ) / O(n 5 ) O(n5)O(n5) n : sequence length

8 8 Early Studies (cont.) Grammars for fully describing RNA pseudoknots: SL-TAG and ESL-TAG [Uemura et al., 1999] RPG [Rivas and Eddy, 2000] These grammars have been identified as subclasses of multiple context-free grammars. [Kato et al., 2005]

9 9 Motivation Multiple context-free grammar (MCFG): Natural extension of CFG Easy to compare generative power and design algorithms Generative power to represent pseudoknots Polynomial time parsing algorithm We have shown a candidate subclass of the minimum grammars of MCFGs for representing pseudoknots. [Kato et al., 2005]

10 10 What’s New in the Present Work Extension of MCFGs to a probabilistic model (stochastic MCFG, SMCFG) Design of polynomial time parsing and parameter estimation algorithms for the subclass of SMCFGs Experiments on RNA pseudoknot prediction

11 11 Early Studies and Present Work Brown & Wilson / Cai et al. Rivas & Eddy / Uemura et al. Matsui et al. Our model Generative power Based on CFG > CFG Metric for optimization Stochastic grammar Free energy Stochastic grammar Time complexity O(n 3 ) / O(n 6 )O(n 6 ) / O(n 5 ) O(n5)O(n5)O(n5)O(n5)

12 12 Table of Contents Background Grammatical approach to RNA structure modeling Model Stochastic multiple context-free grammar Algorithms Parsing and parameter estimation Experimental results RNA pseudoknot prediction Summary

13 13 Relation between SMCFG and Major Probabilistic Models MCFG CFG FA SMCFG SCFG HMM Generative power Strong Weak Probabilistic extension A G A C U U Pseudoknot A G A C U Stem-loop Gene finding genes

14 14 From HMM to SCFG HMMSCFG Rule Terminal and nonterminal “Emit a in A and transit to B ” Sequence of nonterminals and terminals

15 15 Stochastic Multiple Context-Free Grammar (SMCFG) G = (N, T, F, P, S) N : finite set of nonterminals, T : finite set of terminals, F : finite set of functions, P : finite set of rules with probabilities, S  N : start symbol SCFGSMCFG Nonterminal generates sequencegenerates tuple of sequences Rule Sequence of nonterminals and terminals A 0,…, A k : nonterminals f : function defined over sequences

16 16 Functions of SMCFG Example:

17 17 Rules of SMCFG Rule: : probability that the rule is applied The sum of the probabilities of the rules with the same left hand side should be one. Example:

18 18 Derivation Trees in SMCFG A1A1 Prob. p 1 … AkAk Prob. p k Prob. A: f … A1A1 AkAk

19 19 Modeling Pseudoknot by SMCFG UP 2L a [(x 1, x 2 )] = (x 1, ax 2 ) UP 2R u [(x 1, x 2 )] = (x 1, x 2 u) (a g, a c u u) (a g, a c u) (a g, c u) A Prob. 0.7 B A Prob. 0.35 Prob. 0.28

20 20 SMCFG for RNA Pseudoknot Modeling W 1,…,W m : nonterminals Note: W 1 is the start symbol. For each rule, two real values called transition probability p 1 (0 < p 1  1) and emission probability p 2 (0 < p 2  1) are specified. Probability of each rule is defined as

21 21 SMCFG G s

22 22 Table of Contents Background Grammatical approach to RNA structure modeling Model Stochastic multiple context-free grammar Algorithms Parsing and parameter estimation Experimental results RNA pseudoknot prediction Summary

23 23 Algorithms for SMCFG CYK algorithm calculates the optimal alignment of a sequence to an SMCFG (the most likely derivation tree). Inside algorithm calculates the probability of a sequence given an SMCFG. Inside-outside algorithm estimates optimal probability parameters for an SMCFG given a set of example sequences.

24 24 CYK Algorithm Input: The following are calculated by dynamic programming: : log maximum probability that W v generates : log maximum probability that W y generates

25 25 CYK Algorithm (cont.) Output: log maximum probability that W 1 generates i.e. : the most likely derivation tree : entire set of probability parameters

26 26 Algorithm [CYK] Initialization: for i←1 to n+1, j←i to n+1, v←1 to m do if //  : empty sequence then else Iteration: for i←n downto 1, j←i  1 to n, k←n+1 downto j+1, l←k  1 to n, v←1 to m // Some examples are shown.

27 27 Algorithm [CYK] (cont.) if WvWv WyWy WzWz ih h+1j k l1n x1x1 x 21 x 22

28 28 Algorithm [CYK] (cont.) if WvWv WyWy i l1l1 i+1j k l1n x1x1 x2x2 aiai alal

29 29 Complexity of CYK Algorithm m : # of nonterminals ( m = a+b ) n : sequence length Time complexity: O(amn 4 +bn 5 ) Space complexity: O(mn 4 )

30 30 Table of Contents Background Grammatical approach to RNA structure modeling Model Stochastic multiple context-free grammar Algorithms Parsing and parameter estimation Experimental results RNA pseudoknot prediction Summary

31 31 Experimental Method Construction of a model RNA family database SMCFG CUAGUC  UUA Test sequence Secondary structure prediction CYK algorithm Sample sequences with structure annotation CUACUG  UUC parsing

32 32 Data Sets for Experiments Three viral RNA families including pseudoknots from Rfam ver. 7.0 FamilyRage of length # of annotated sequences # of test sequences Corona_pk_3 62  64 1410 HDV_ribozyme 87  91 1510 Tombus_3_IV 89  92 1812

33 33 Corona_pk_3 in Rfam ver. 7.0 Coronavirus 3' UTR pseudoknot Sequence length: 62  64 Consensus structure

34 34 HDV_ribozyme in Rfam ver. 7.0 Hepatitis delta virus ribozyme Sequence length: 87  91 Consensus structure

35 35 Tombus_3_IV in Rfam ver. 7.0 Tombusvirus 3' UTR region IV Sequence length: 89  92 Consensus structure

36 36 Evaluation for Prediction Results precision = recall = # of correct base pairs predicted by the algorithm # of predicted base pairs # of correct base pairs predicted by the algorithm # of base pairs specified by the annotation

37 37 Experimental Results Prediction accuracy RNA family Precision [%]Recall [%] AverageMinMaxAverageMinMax Corona_pk_399.494.4100.099.494.4100.0 HDV_ribozyme100.0 Tombus_3_IV100.0

38 38 Experimental Results (cont.) Running time *: Implementation in ANSI C on a machine with Intel Pentium D CPU 2.80GHZ and 2.00GB RAM RNA family CPU time* [sec] AverageMinMax Corona_pk_327.826.030.4 HDV_ribozyme252.1219.0278.4 Tombus_3_IV244.8215.2257.5

39 39 Pair Stochastic Tree Adjoining Grammar (PSTAG) [MSS05] Derivation tree representing known structure Test sequence alignment [MSS05] Matsui et al., “Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures,” Bioinformatics, 2005. RNA family database PSTAG algorithm CUAGUC  UUA Secondary structure prediction Sample sequences with structure annotation CUACUG  UUC

40 40 Comparison with PSTAG Model Average precision [%]Average recall [%] CoronaHDVTombusCoronaHDVTombus SMCFG99.4100.0 99.4100.0 PSTAG95.595.697.494.694.197.4

41 41 Summary A new probabilistic model called SMCFG has been proposed for RNA pseudoknot modeling. Polynomial time parsing and parameter estimation algorithms have been designed. Experimental results on RNA pseudoknot prediction have shown good prediction accuracy.

42 42 Inclusion Relation between Class of Languages * * * MCFL (2, 2)-MCFL RPL TAL = (2, 2)-MCFL (degree  5) = (2, 1)-MCFL CFL ** * * : non-empty * = ESL-TAL SL-TAL

43 43 Derivation in SMCFG ( d : positive integer )  with probability p with probabilities p 1,…,p k  with probability

44 44 Example of Derivation Rules and Function: Derivation: with probability 0.7 with probability

45 45 Algorithm [CYK] (cont.) if Note: WvWv WyWy i h+1j1n h

46 46 Inside Algorithm Input: The following are calculated in a similar way to the CYK algorithm: : summed probabilities that W v generates : summed probabilities that W y generates Output: summed probabilities that W 1 generates i.e. Complexity is the same as that of CYK.

47 47 Outside Algorithm Input: The following are calculated using inside variables : : summed probabilities that W 1 generates input sequence excluding generated by W v : summed probabilities that W 1 generates input sequence excluding generated by W y Complexity is the same as that of CYK.

48 48 Inside-Outside Algorithm For each training sequence w (r) (r = 1,…,N), the inside variables  (r) and outside variables  (r) are calculated. Example: re-estimation of transition probability t v (y) for Expected count that W v is used for w (r) (r=1,…,N) :

49 49 Inside-Outside Algorithm (cont.) For a given W y, expected count that is applied: Re-estimated value of t v (y) :

50 50 Experimental Results (cont.) 94.4% precision and recall prediction Corona_pk_3 (EMBL accession #: X51325.1) Trusted structure in Rfam Prediction by SMCFG CUAGUCUUAUACACAAUGGUAAGCCAGUGGUAGUAAAGGUAUAAGAAAUUUGCUACUAUGUUA [[[[[[[[ ((( ((((((( ]]]]]]]] ))))))) ))) CUAGUCUUAUACACAAUGGUAAGCCAGUGGUAGUAAAGGUAUAAGAAAUUUGCUACUAUGUUA [[[[[[[[ (((((((((( ]]]]]]]] )))))))))) : true positive


Download ppt "RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki."

Similar presentations


Ads by Google