Download presentation
Presentation is loading. Please wait.
1
Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001
2
Outline Goals of the project Quick Review of background material Data input and parsing The inside algorithm The Cocke-Younger-Kasami algorithm Implementation details and results
3
Goals for this project Build a user interface for easy definition of the grammar Read grammar into memory and compute –a. probability that the specified grammar produced a sample sequence –b. the most probable parse tree for that sequence Remaining issue: Parameter Re-estimation Implement a stochastic context free grammar Model a small sequence using a sample SCFG
4
Quick Review Context Free Grammar: W=> (1 non terminal)=>(Any number of terminals/non- terminals) Same CFG in Chomsky Normal Form only W v => W x W y or W v => a (terminal) (1 non terminal)=>(two non-terminals or 1 terminal) Any CFG can be put into Normal Form by adding additional non-terminals Choose Normal form for Computational ease
5
S ABC (0.9) A a(0.5) B b(0.8)C c(0.6) S AD (0.9) D BC (1.0) A a (0.5) B b (0.8) C c (0.6) Stochastic Context Free Grammar in Normal Chomsky Form All productions have associated transition probabilities Given a grammar in this form and a sample sequence we want to compute the probability that this grammar produced the sequence - inside algorithm find the optimal parse tree through this grammar that results in the sample sequence - CYK algorithm
6
W2 WM a b c W3 W1 Terminal Symbols Non-terminal productions, indexed for easy lookup Grammar stored in linked lists index W4
7
Solving the problem 1. Users input their grammar in normal form 2. Grammar written to file SAD$0.9 (S=>A D with probability( 0.9) DBC$1.0 Aa$0.5 Bb$0.8 Cc$0.6 * One line per production rule * Grammar starts with start symbol * Each line denotes a transition between non-terminals or between a non-terminal and a terminal * probability a transition is given after the $ symbol
8
Purpose: Compute (i,j, ) the probability of a subtree rooted with the non- terminal deriving the subword ( x i … x j ) of the sequence (x 1 ….X L ) given the grammar G (i,j, ) = P( * x i … x j |G) computed in a recursive manner from the bottom up starting with subwords of length one. The inside algorithm
9
L1ikk+1 j V y z Initialisation: for i = 1 to L, v 1 to M (i,i,v)= e v (x i ) Iteration:for i=1to L-1, j=i+1, v=1 to M (i,i,v)= y=1,M z=1,M k=i,j-1 (i,k,y) (k+1,j,z)t v (y,z) W v => W x W y t v (y,z) W v => a e v (a)
10
References 1. [Brown and Wilson, 1995] Brown, M.P.S. and Wilson, C. Rna pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. In Hunter, L. and Klein, T., editors. Pacific Synposium on Biocomputing, pages 109-125 2. [Brown 1999] Brown, M.P.S., “RNA Modeling Using Stochastic Context- Free Grammars”, ph.D thesis. 3. [Eddy and Durbin, 1994] Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models. NAR, 22:2079-2088. 4. [Krogh et al., 1994] Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. Hidden Markov models in computational biology: Applications to protein modeling. JMB, 235:1501-1531. 5. [Lowe and Eddy, 1999] Lowe, T. and Eddy, S. A computational screen for methylation guide snornas in yeast. Science, 283:1168-1171. 6. [Sakakibara et al., 1994] Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjolander, K., Underwood, R. C., and Haussler, D. Stochastic context-free grammars for tRNA modeling. NAR, 22:5112-5120. 7. [Underwood, 1994] Underwood, R. C. Stochastic context-free grammars for modeling three spliceosomal small nuclear ribonucleic acids. Master thesis, University of California, Santa Cruz.
11
END
12
Outside Algorithm The outside probability, (I,j,v), is the probability that starting from the start non-terminal the non-terminal v is generated and the string not dominated by it is ( x 1 …x i-1 ) to the left and ( x j+1 …x L ) to the right. (i,j,v) = P( S * x 1 … x i-1 vx j+1 … x L |G). The outside variable can be computed in a recursive manner starting with the largest excluded subsequence i-1 L (i,j,v) = (k,i-1,z) (k,j,y) t y (z,v) + (j+1,k,z) (i,k,y) t y (v,z). y z k=1 y z k=j+I The probability that a non-terminal v derives the subword (i, j) is given as (i,j,v) (i,j,v)/P(x|G)
13
W2 WM a b c W3 W1 Terminal Symbols Non-terminal productions, indexed for easy lookup Grammar stored in linked lists index W4
14
Why SCFG ? More Powerful –evolution processes of mutation, insertion, deletion –interaction between basepairs C A A A G A C G G C A U C G G C U A GACGCAAGUC UCGGAAACGA
15
Some Application of SCFG modeling t-RNA – [Sakakibara et al., 1994a] – [Eddy and Durbin, 1994] snRNAs – [Underwood, 1994] a pseudoknotted biotin binder – [Brown and Wilson, 1995] snoRNA – [Lowe and Eddy, 1999] small subunit ribosomal RNA – [Brown, 1999]
16
3. Generate e and t e: probability for rules like W a t: probability for rules like W XY 4. Compute (i,j, ) –the probability of a subtree rooted with the non-terminal deriving the subword ( x i … x j ) (i,j, ) = P( * x i … x j |G) –computed recursively in a bottom up fashion starting with subwords of length one. M M j-1 (i,j, ) = (i,k,y) (k+1,j,z)t (y,z), y=1 z=1 k=I Working Procedure
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.