Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001
Outline Goals of the project Quick Review of background material Data input and parsing The inside algorithm The Cocke-Younger-Kasami algorithm Implementation details and results
Goals for this project Build a user interface for easy definition of the grammar Read grammar into memory and compute –a. probability that the specified grammar produced a sample sequence –b. the most probable parse tree for that sequence Remaining issue: Parameter Re-estimation Implement a stochastic context free grammar Model a small sequence using a sample SCFG
Quick Review Context Free Grammar: W=> (1 non terminal)=>(Any number of terminals/non- terminals) Same CFG in Chomsky Normal Form only W v => W x W y or W v => a (terminal) (1 non terminal)=>(two non-terminals or 1 terminal) Any CFG can be put into Normal Form by adding additional non-terminals Choose Normal form for Computational ease
S ABC (0.9) A a(0.5) B b(0.8)C c(0.6) S AD (0.9) D BC (1.0) A a (0.5) B b (0.8) C c (0.6) Stochastic Context Free Grammar in Normal Chomsky Form All productions have associated transition probabilities Given a grammar in this form and a sample sequence we want to compute the probability that this grammar produced the sequence - inside algorithm find the optimal parse tree through this grammar that results in the sample sequence - CYK algorithm
W2 WM a b c W3 W1 Terminal Symbols Non-terminal productions, indexed for easy lookup Grammar stored in linked lists index W4
Solving the problem 1. Users input their grammar in normal form 2. Grammar written to file SAD$0.9 (S=>A D with probability( 0.9) DBC$1.0 Aa$0.5 Bb$0.8 Cc$0.6 * One line per production rule * Grammar starts with start symbol * Each line denotes a transition between non-terminals or between a non-terminal and a terminal * probability a transition is given after the $ symbol
Purpose: Compute (i,j, ) the probability of a subtree rooted with the non- terminal deriving the subword ( x i … x j ) of the sequence (x 1 ….X L ) given the grammar G (i,j, ) = P( * x i … x j |G) computed in a recursive manner from the bottom up starting with subwords of length one. The inside algorithm
L1ikk+1 j V y z Initialisation: for i = 1 to L, v 1 to M (i,i,v)= e v (x i ) Iteration:for i=1to L-1, j=i+1, v=1 to M (i,i,v)= y=1,M z=1,M k=i,j-1 (i,k,y) (k+1,j,z)t v (y,z) W v => W x W y t v (y,z) W v => a e v (a)
References 1. [Brown and Wilson, 1995] Brown, M.P.S. and Wilson, C. Rna pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. In Hunter, L. and Klein, T., editors. Pacific Synposium on Biocomputing, pages [Brown 1999] Brown, M.P.S., “RNA Modeling Using Stochastic Context- Free Grammars”, ph.D thesis. 3. [Eddy and Durbin, 1994] Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models. NAR, 22: [Krogh et al., 1994] Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. Hidden Markov models in computational biology: Applications to protein modeling. JMB, 235: [Lowe and Eddy, 1999] Lowe, T. and Eddy, S. A computational screen for methylation guide snornas in yeast. Science, 283: [Sakakibara et al., 1994] Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjolander, K., Underwood, R. C., and Haussler, D. Stochastic context-free grammars for tRNA modeling. NAR, 22: [Underwood, 1994] Underwood, R. C. Stochastic context-free grammars for modeling three spliceosomal small nuclear ribonucleic acids. Master thesis, University of California, Santa Cruz.
END
Outside Algorithm The outside probability, (I,j,v), is the probability that starting from the start non-terminal the non-terminal v is generated and the string not dominated by it is ( x 1 …x i-1 ) to the left and ( x j+1 …x L ) to the right. (i,j,v) = P( S * x 1 … x i-1 vx j+1 … x L |G). The outside variable can be computed in a recursive manner starting with the largest excluded subsequence i-1 L (i,j,v) = (k,i-1,z) (k,j,y) t y (z,v) + (j+1,k,z) (i,k,y) t y (v,z). y z k=1 y z k=j+I The probability that a non-terminal v derives the subword (i, j) is given as (i,j,v) (i,j,v)/P(x|G)
W2 WM a b c W3 W1 Terminal Symbols Non-terminal productions, indexed for easy lookup Grammar stored in linked lists index W4
Why SCFG ? More Powerful –evolution processes of mutation, insertion, deletion –interaction between basepairs C A A A G A C G G C A U C G G C U A GACGCAAGUC UCGGAAACGA
Some Application of SCFG modeling t-RNA – [Sakakibara et al., 1994a] – [Eddy and Durbin, 1994] snRNAs – [Underwood, 1994] a pseudoknotted biotin binder – [Brown and Wilson, 1995] snoRNA – [Lowe and Eddy, 1999] small subunit ribosomal RNA – [Brown, 1999]
3. Generate e and t e: probability for rules like W a t: probability for rules like W XY 4. Compute (i,j, ) –the probability of a subtree rooted with the non-terminal deriving the subword ( x i … x j ) (i,j, ) = P( * x i … x j |G) –computed recursively in a bottom up fashion starting with subwords of length one. M M j-1 (i,j, ) = (i,k,y) (k+1,j,z)t (y,z), y=1 z=1 k=I Working Procedure