Download presentation
Presentation is loading. Please wait.
1
Project No. 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16 2001
2
Why SCFG ? More Powerful –evolution processes of mutation, insertion, deletion –interaction between basepairs C A A A G A C G G C A U C G G C U A GACGCAAGUC UCGGAAACGA
3
Problems can be solved by SCFG Scoring problem –compute P( x| G) –inside algorithm Alignment problem –find the most likely generating path –Cocke-Younger-Kasami (CYK) algorithm Training problem –parameter re-estimation –inside-outside algorithm
4
Normal Form for SCFG SCFGs can have unlimited symbols on the right hand side of the production S ABC(0.9) A a(0.5) B b(0.8)C c(0.6) Chomsky Normal Form –requires all production rules to be of the form W XY or W a S AD (0.9) D BC (1.0) A a (0.5) B b (0.8) C c (0.6)
5
Example Grammar A BA (0.4) A CA (0.2) A a (0.3) A g (0.1) B CA (0.4) B g (0.3) B t (0.3) C CA (0.6) C t (0.2) C c (0.2) A B C a 0.3 0.0 0.0 g 0.1 0.3 0.0 t 0.0 0.3 0.2 c 0.0 0.0 0.2 G E A A B C A 0.0 0.0 0.0 B 0.0 0.0 0.0 C 0.0 0.0 0.0 T B A B C 0.4 0.0 0.0 0.0 0.0 0.0 C A B C 0.2 0.0 0.0 0.4 0.0 0.0 0.6 0.0 0.0 grammarmatrix
6
Inside Algorithm Compute (i, j, ) –the probability of a subtree rooted with the non-terminal deriving the subword ( x i … x j ) (i, j, ) = P( * x i … x j |G) –computed recursively in a bottom up fashion starting with subwords of length one. M M j-1 (i, j, ) = (i,k,y) (k+1,j,z)t (y,z) y=1 z=1 k=I
7
Inside Algorithm Basic Idea: –calculate P start with subsequences of length 1 –then subsequences of length 2 –… … –continue working on longer and longer subsequences –until, a probability is determined for the complete parse tree rooted at the start non-terminal
8
Example for Inside Algorithm Input sequence: tagNon-terminals: A B C 1. calculate P for subsequences of length 1 2. calculate P for subsequences of length 2 A t A a A g B t B a B g C t C a C g B ta A ag B ta B ag C ta C ag 3. calculate P for subsequences of length 3 A tag Copy from matrix E A BA B ta A g A CA C ta A g B CA C t A a A ta
9
Outside Algorithm Compute (i, j, ) –the probability that starting from the start non-terminal the non-terminal v is generated, and the string not dominated by it is ( x 1 …x i-1 ) to the left and ( x j+1 …x L ) to the right. (i,j,v) = P( S * x 1 … x i-1 vx j+1 … x L |G).
10
Outside Algorithm Compute (i, j, ) –computed recursively starting with the largest excluded subsequence i-1 (i,j,v) = (k,i-1,z) (k,j,y) t y (z,v) + y z k=1 L (j+1,k,z) (i,k,y) t y (v,z). y z k=j+I
11
Outside Algorithm Basic Idea:... x1x1 x2x2 x i-1 xixi... xjxj x j+1... xLxL S v (i, j, )
12
Outside Algorithm Basic Idea:... x1x1 xkxk x i-1... xjxj xLxL S z (i, j, )... xixi v y
13
Outside Algorithm Basic Idea:... x1x1 xkxk x j+1... xLxL S v (i, j, )... xixi z y xjxj
14
CYK Algorithm Basic Idea: –find the most likely generating path starting with subsequences of length 1 –then subsequences of length 2 –… … –continue working on longer and longer subsequences –until, a path is determined for the complete parse tree rooted at the start non-terminal
15
Example for CYK Algorithm Input sequence: tagNon-terminals: A B C 1. Find path for subsequences of length 1 2. Find path for subsequences of length 2 A t A a A g B t B a B g C t C a C g B ta A ag B ta B ag C ta C ag 3. Find path for subsequences of length 3 A tag Copy from matrix E A BA B ta A g B CA C t A a A ta
16
References 1. [Brown and Wilson, 1995] Brown, M.P.S. and Wilson, C. Rna pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. In Hunter, L. and Klein, T., editors. Pacific Synposium on Biocomputing, pages 109-125 2. [Brown 1999] Brown, M.P.S., “RNA Modeling Using Stochastic Context- Free Grammars”, ph.D thesis. 3. [Eddy and Durbin, 1994] Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models. NAR, 22:2079-2088. 4. [Krogh et al., 1994] Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. Hidden Markov models in computational biology: Applications to protein modeling. JMB, 235:1501-1531. 5. [Lowe and Eddy, 1999] Lowe, T. and Eddy, S. A computational screen for methylation guide snornas in yeast. Science, 283:1168-1171. 6. [Sakakibara et al., 1994] Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjolander, K., Underwood, R. C., and Haussler, D. Stochastic context-free grammars for tRNA modeling. NAR, 22:5112-5120. 7. [Underwood, 1994] Underwood, R. C. Stochastic context-free grammars for modeling three spliceosomal small nuclear ribonucleic acids. Master thesis, University of California, Santa Cruz.
17
A Test System User interfaces Compute P( x| G) Find Viterbi path Primary target: modeling a small sequence using SCFG
20
Display Grammar Window
22
Calculating Probability for “tag”
23
Calculating Probability for “tgacg”
24
Finding Most Likely Path for “tag”
25
Finding Most Likely Path for “tgacg”
26
Finding Most Likely Path for “tgtacggta”
28
Training Algorithm Basic Idea: –an original grammar, G, which can model a family of sequence, but not good enough –feed it with sequence x –compute the probability of P(x|G) –adjust the G to be G`
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.