Download presentation
Presentation is loading. Please wait.
Published byEsther Harrell Modified over 8 years ago
1
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki Seki and Tadao Kasami Graduate School of Information Science, Nara Institute of Science and Technology (NAIST)
2
2 NAIST
3
3 Table of Contents Background Grammatical approach to RNA structure modeling Model Stochastic multiple context-free grammar Algorithms Parsing and parameter estimation Experimental results RNA pseudoknot prediction Summary
4
4 RNA Secondary Structure: Stem-Loop 5’—C A A U G A C—3’ CG UA U C A C U U C A U C A G A A A A U G A C nested Connect base pairs with arcs. Loop Stem Complementary base pairs AU GC
5
5 Modeling RNA Secondary Structure by Context-Free Grammar (CFG) RNA secondary structure can be modeled by parse structure of CFG. Structure prediction Parsing Example of CFG rules: U U C A U C A G A A Secondary structure S S S S u u c a u c a g a a Derivation tree
6
6 RNA Secondary Structure: Pseudoknot CFGs cannot represent pseudoknots. 5’—C U U C A A G A C U U G A C—3’ A A A C U U C A U C A G A A A A U G A C crossed Connect base pairs with arcs.
7
7 Early Studies Brown & Wilson / Cai et al. Rivas & Eddy / Uemura et al. Matsui et al. Generative power Based on CFG > CFG Metric for optimization Stochastic grammar Free energy Stochastic grammar Time complexity O(n 3 ) / O(n 6 ) O(n 6 ) / O(n 5 ) O(n5)O(n5) n : sequence length
8
8 Early Studies (cont.) Grammars for fully describing RNA pseudoknots: SL-TAG and ESL-TAG [Uemura et al., 1999] RPG [Rivas and Eddy, 2000] These grammars have been identified as subclasses of multiple context-free grammars. [Kato et al., 2005]
9
9 Motivation Multiple context-free grammar (MCFG): Natural extension of CFG Easy to compare generative power and design algorithms Generative power to represent pseudoknots Polynomial time parsing algorithm We have shown a candidate subclass of the minimum grammars of MCFGs for representing pseudoknots. [Kato et al., 2005]
10
10 What’s New in the Present Work Extension of MCFGs to a probabilistic model (stochastic MCFG, SMCFG) Design of polynomial time parsing and parameter estimation algorithms for the subclass of SMCFGs Experiments on RNA pseudoknot prediction
11
11 Early Studies and Present Work Brown & Wilson / Cai et al. Rivas & Eddy / Uemura et al. Matsui et al. Our model Generative power Based on CFG > CFG Metric for optimization Stochastic grammar Free energy Stochastic grammar Time complexity O(n 3 ) / O(n 6 )O(n 6 ) / O(n 5 ) O(n5)O(n5)O(n5)O(n5)
12
12 Table of Contents Background Grammatical approach to RNA structure modeling Model Stochastic multiple context-free grammar Algorithms Parsing and parameter estimation Experimental results RNA pseudoknot prediction Summary
13
13 Relation between SMCFG and Major Probabilistic Models MCFG CFG FA SMCFG SCFG HMM Generative power Strong Weak Probabilistic extension A G A C U U Pseudoknot A G A C U Stem-loop Gene finding genes
14
14 From HMM to SCFG HMMSCFG Rule Terminal and nonterminal “Emit a in A and transit to B ” Sequence of nonterminals and terminals
15
15 Stochastic Multiple Context-Free Grammar (SMCFG) G = (N, T, F, P, S) N : finite set of nonterminals, T : finite set of terminals, F : finite set of functions, P : finite set of rules with probabilities, S N : start symbol SCFGSMCFG Nonterminal generates sequencegenerates tuple of sequences Rule Sequence of nonterminals and terminals A 0,…, A k : nonterminals f : function defined over sequences
16
16 Functions of SMCFG Example:
17
17 Rules of SMCFG Rule: : probability that the rule is applied The sum of the probabilities of the rules with the same left hand side should be one. Example:
18
18 Derivation Trees in SMCFG A1A1 Prob. p 1 … AkAk Prob. p k Prob. A: f … A1A1 AkAk
19
19 Modeling Pseudoknot by SMCFG UP 2L a [(x 1, x 2 )] = (x 1, ax 2 ) UP 2R u [(x 1, x 2 )] = (x 1, x 2 u) (a g, a c u u) (a g, a c u) (a g, c u) A Prob. 0.7 B A Prob. 0.35 Prob. 0.28
20
20 SMCFG for RNA Pseudoknot Modeling W 1,…,W m : nonterminals Note: W 1 is the start symbol. For each rule, two real values called transition probability p 1 (0 < p 1 1) and emission probability p 2 (0 < p 2 1) are specified. Probability of each rule is defined as
21
21 SMCFG G s
22
22 Table of Contents Background Grammatical approach to RNA structure modeling Model Stochastic multiple context-free grammar Algorithms Parsing and parameter estimation Experimental results RNA pseudoknot prediction Summary
23
23 Algorithms for SMCFG CYK algorithm calculates the optimal alignment of a sequence to an SMCFG (the most likely derivation tree). Inside algorithm calculates the probability of a sequence given an SMCFG. Inside-outside algorithm estimates optimal probability parameters for an SMCFG given a set of example sequences.
24
24 CYK Algorithm Input: The following are calculated by dynamic programming: : log maximum probability that W v generates : log maximum probability that W y generates
25
25 CYK Algorithm (cont.) Output: log maximum probability that W 1 generates i.e. : the most likely derivation tree : entire set of probability parameters
26
26 Algorithm [CYK] Initialization: for i←1 to n+1, j←i to n+1, v←1 to m do if // : empty sequence then else Iteration: for i←n downto 1, j←i 1 to n, k←n+1 downto j+1, l←k 1 to n, v←1 to m // Some examples are shown.
27
27 Algorithm [CYK] (cont.) if WvWv WyWy WzWz ih h+1j k l1n x1x1 x 21 x 22
28
28 Algorithm [CYK] (cont.) if WvWv WyWy i l1l1 i+1j k l1n x1x1 x2x2 aiai alal
29
29 Complexity of CYK Algorithm m : # of nonterminals ( m = a+b ) n : sequence length Time complexity: O(amn 4 +bn 5 ) Space complexity: O(mn 4 )
30
30 Table of Contents Background Grammatical approach to RNA structure modeling Model Stochastic multiple context-free grammar Algorithms Parsing and parameter estimation Experimental results RNA pseudoknot prediction Summary
31
31 Experimental Method Construction of a model RNA family database SMCFG CUAGUC UUA Test sequence Secondary structure prediction CYK algorithm Sample sequences with structure annotation CUACUG UUC parsing
32
32 Data Sets for Experiments Three viral RNA families including pseudoknots from Rfam ver. 7.0 FamilyRage of length # of annotated sequences # of test sequences Corona_pk_3 62 64 1410 HDV_ribozyme 87 91 1510 Tombus_3_IV 89 92 1812
33
33 Corona_pk_3 in Rfam ver. 7.0 Coronavirus 3' UTR pseudoknot Sequence length: 62 64 Consensus structure
34
34 HDV_ribozyme in Rfam ver. 7.0 Hepatitis delta virus ribozyme Sequence length: 87 91 Consensus structure
35
35 Tombus_3_IV in Rfam ver. 7.0 Tombusvirus 3' UTR region IV Sequence length: 89 92 Consensus structure
36
36 Evaluation for Prediction Results precision = recall = # of correct base pairs predicted by the algorithm # of predicted base pairs # of correct base pairs predicted by the algorithm # of base pairs specified by the annotation
37
37 Experimental Results Prediction accuracy RNA family Precision [%]Recall [%] AverageMinMaxAverageMinMax Corona_pk_399.494.4100.099.494.4100.0 HDV_ribozyme100.0 Tombus_3_IV100.0
38
38 Experimental Results (cont.) Running time *: Implementation in ANSI C on a machine with Intel Pentium D CPU 2.80GHZ and 2.00GB RAM RNA family CPU time* [sec] AverageMinMax Corona_pk_327.826.030.4 HDV_ribozyme252.1219.0278.4 Tombus_3_IV244.8215.2257.5
39
39 Pair Stochastic Tree Adjoining Grammar (PSTAG) [MSS05] Derivation tree representing known structure Test sequence alignment [MSS05] Matsui et al., “Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures,” Bioinformatics, 2005. RNA family database PSTAG algorithm CUAGUC UUA Secondary structure prediction Sample sequences with structure annotation CUACUG UUC
40
40 Comparison with PSTAG Model Average precision [%]Average recall [%] CoronaHDVTombusCoronaHDVTombus SMCFG99.4100.0 99.4100.0 PSTAG95.595.697.494.694.197.4
41
41 Summary A new probabilistic model called SMCFG has been proposed for RNA pseudoknot modeling. Polynomial time parsing and parameter estimation algorithms have been designed. Experimental results on RNA pseudoknot prediction have shown good prediction accuracy.
42
42 Inclusion Relation between Class of Languages * * * MCFL (2, 2)-MCFL RPL TAL = (2, 2)-MCFL (degree 5) = (2, 1)-MCFL CFL ** * * : non-empty * = ESL-TAL SL-TAL
43
43 Derivation in SMCFG ( d : positive integer ) with probability p with probabilities p 1,…,p k with probability
44
44 Example of Derivation Rules and Function: Derivation: with probability 0.7 with probability
45
45 Algorithm [CYK] (cont.) if Note: WvWv WyWy i h+1j1n h
46
46 Inside Algorithm Input: The following are calculated in a similar way to the CYK algorithm: : summed probabilities that W v generates : summed probabilities that W y generates Output: summed probabilities that W 1 generates i.e. Complexity is the same as that of CYK.
47
47 Outside Algorithm Input: The following are calculated using inside variables : : summed probabilities that W 1 generates input sequence excluding generated by W v : summed probabilities that W 1 generates input sequence excluding generated by W y Complexity is the same as that of CYK.
48
48 Inside-Outside Algorithm For each training sequence w (r) (r = 1,…,N), the inside variables (r) and outside variables (r) are calculated. Example: re-estimation of transition probability t v (y) for Expected count that W v is used for w (r) (r=1,…,N) :
49
49 Inside-Outside Algorithm (cont.) For a given W y, expected count that is applied: Re-estimated value of t v (y) :
50
50 Experimental Results (cont.) 94.4% precision and recall prediction Corona_pk_3 (EMBL accession #: X51325.1) Trusted structure in Rfam Prediction by SMCFG CUAGUCUUAUACACAAUGGUAAGCCAGUGGUAGUAAAGGUAUAAGAAAUUUGCUACUAUGUUA [[[[[[[[ ((( ((((((( ]]]]]]]] ))))))) ))) CUAGUCUUAUACACAAUGGUAAGCCAGUGGUAGUAAAGGUAUAAGAAAUUUGCUACUAUGUUA [[[[[[[[ (((((((((( ]]]]]]]] )))))))))) : true positive
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.