Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

RNA Secondary Structure Prediction
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Translator Architecture Code Generator ParserTokenizer string of characters (source code) string of tokens abstract program string of integers (object.
101 The Cocke-Kasami-Younger Algorithm An example of bottom-up parsing, for CFG in Chomsky normal form G :S  AB | BB A  CC | AB | a B  BB | CA | b C.
Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.
Lecture # 11 Grammar Problems.
The Cocke-Younger-Kasami Algorithm*
Stochastic Context Free Grammars for RNA Modeling CS 838 Mark Craven May 2001.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
CS5371 Theory of Computation
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
1 CSC 3130: Automata theory and formal languages Tutorial 4 KN Hung Office: SHB 1026 Department of Computer Science & Engineering.
CS 310 – Fall 2006 Pacific University CS310 Pushdown Automata Sections: 2.2 page 109 October 9, 2006.
CS Master – Introduction to the Theory of Computation Jan Maluszynski - HT Lecture 4 Context-free grammars Jan Maluszynski, IDA, 2007
Sónia Martins Bruno Martins José Cruz IGC, February 20 th, 2008.
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
1 Background Information for the Pumping Lemma for Context-Free Languages Definition: Let G = (V, T, P, S) be a CFL. If every production in P is of the.
Context-Free Grammars Chapter 3. 2 Context-Free Grammars and Languages n Defn A context-free grammar is a quadruple (V, , P, S), where  V is.
More on Text Management. Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate.
Project No. 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Lecture 21: Languages and Grammars. Natural Language vs. Formal Language.
Formal Grammars Denning, Sections 3.3 to 3.6. Formal Grammar, Defined A formal grammar G is a four-tuple G = (N,T,P,  ), where N is a finite nonempty.
Lecture 16 Oct 18 Context-Free Languages (CFL) - basic definitions Examples.
CONVERTING TO CHOMSKY NORMAL FORM
Context-Free Grammars Normal Forms Chapter 11. Normal Forms A normal form F for a set C of data objects is a form, i.e., a set of syntactically valid.
Normal Forms for Context-Free Grammars Definition: A symbol X in V  T is useless in a CFG G=(V, T, P, S) if there does not exist a derivation of the form.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
CSCI 2670 Introduction to Theory of Computing September 21, 2004.
Some Probability Theory and Computational models A short overview.
Lecture # 9 Chap 4: Ambiguous Grammar. 2 Chomsky Hierarchy: Language Classification A grammar G is said to be – Regular if it is right linear where each.
CS 3240: Languages and Computation Context-Free Languages.
The CYK Algorithm Presented by Aalapee Patel Tyler Ondracek CS6800 Spring 2014.
Membership problem CYK Algorithm Project presentation CS 5800 Spring 2013 Professor : Dr. Elise de Doncker Presented by : Savitha parur venkitachalam.
Phrase-structure grammar A phrase-structure grammar is a quadruple G = (V, T, P, S) where V is a finite set of symbols called nonterminals, T is a set.
CMSC 330: Organization of Programming Languages Context-Free Grammars.
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki.
Chapter 1 Introduction Major Data Structures in Compiler
CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
1 Chapter 6 Simplification of CFGs and Normal Forms.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Normal forms.
Stochastic Context Free Grammars CBB 261 for noncoding RNA gene prediction B. Majoros.
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 2 Context-Free Languages Some slides are in courtesy.
Transparency No. 1 Formal Language and Automata Theory Homework 5.
Grammar Set of variables Set of terminal symbols Start variable Set of Production rules.
Exercises on Chomsky Normal Form and CYK parsing
Chomsky Normal Form.
The estimation of stochastic context-free grammars using the Inside-Outside algorithm Oh-Woog Kwon KLE Lab. CSE POSTECH.
CSE 311 Foundations of Computing I Lecture 20 Context-Free Grammars and Languages Autumn 2012 CSE
LR(k) grammars The Chinese University of Hong Kong Fall 2009
Context-free grammars, derivation trees, and ambiguity
Syntax Specification and Analysis
Stochastic Context-Free Grammars for Modeling RNA
Complexity and Computability Theory I
Simplifications of Context-Free Grammars
Stochastic Context-Free Grammars for Modeling RNA
LR(1) grammars The Chinese University of Hong Kong Fall 2010
Definition: Let G = (V, T, P, S) be a CFL
Properties of Context-Free Languages
Stochastic Context Free Grammars for RNA Structure Modeling
Theory of Computation Lecture #
Presentation transcript:

Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001

Outline Goals of the project Quick Review of background material Data input and parsing The inside algorithm The Cocke-Younger-Kasami algorithm Implementation details and results

Goals for this project Build a user interface for easy definition of the grammar Read grammar into memory and compute –a. probability that the specified grammar produced a sample sequence –b. the most probable parse tree for that sequence Remaining issue: Parameter Re-estimation Implement a stochastic context free grammar Model a small sequence using a sample SCFG

Quick Review Context Free Grammar: W=>  (1 non terminal)=>(Any number of terminals/non- terminals) Same CFG in Chomsky Normal Form only W v => W x W y or W v => a (terminal) (1 non terminal)=>(two non-terminals or 1 terminal) Any CFG can be put into Normal Form by adding additional non-terminals Choose Normal form for Computational ease

S  ABC (0.9) A  a(0.5) B  b(0.8)C  c(0.6) S  AD (0.9) D  BC (1.0) A  a (0.5) B  b (0.8) C  c (0.6) Stochastic Context Free Grammar in Normal Chomsky Form All productions have associated transition probabilities Given a grammar in this form and a sample sequence we want to compute the probability that this grammar produced the sequence - inside algorithm find the optimal parse tree through this grammar that results in the sample sequence - CYK algorithm

W2 WM a b c W3 W1 Terminal Symbols Non-terminal productions, indexed for easy lookup Grammar stored in linked lists index W4

Solving the problem 1. Users input their grammar in normal form 2. Grammar written to file SAD$0.9 (S=>A D with probability( 0.9) DBC$1.0 Aa$0.5 Bb$0.8 Cc$0.6 * One line per production rule * Grammar starts with start symbol * Each line denotes a transition between non-terminals or between a non-terminal and a terminal * probability a transition is given after the $ symbol

Purpose: Compute  (i,j, ) the probability of a subtree rooted with the non- terminal deriving the subword ( x i … x j ) of the sequence (x 1 ….X L ) given the grammar G  (i,j, ) = P(  * x i … x j |G) computed in a recursive manner from the bottom up starting with subwords of length one. The inside algorithm

L1ikk+1 j V y z Initialisation: for i = 1 to L, v 1 to M  (i,i,v)= e v (x i ) Iteration:for i=1to L-1, j=i+1, v=1 to M  (i,i,v)=  y=1,M  z=1,M  k=i,j-1  (i,k,y)  (k+1,j,z)t v (y,z) W v => W x W y t v (y,z) W v => a e v (a)

References 1. [Brown and Wilson, 1995] Brown, M.P.S. and Wilson, C. Rna pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. In Hunter, L. and Klein, T., editors. Pacific Synposium on Biocomputing, pages [Brown 1999] Brown, M.P.S., “RNA Modeling Using Stochastic Context- Free Grammars”, ph.D thesis. 3. [Eddy and Durbin, 1994] Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models. NAR, 22: [Krogh et al., 1994] Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. Hidden Markov models in computational biology: Applications to protein modeling. JMB, 235: [Lowe and Eddy, 1999] Lowe, T. and Eddy, S. A computational screen for methylation guide snornas in yeast. Science, 283: [Sakakibara et al., 1994] Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjolander, K., Underwood, R. C., and Haussler, D. Stochastic context-free grammars for tRNA modeling. NAR, 22: [Underwood, 1994] Underwood, R. C. Stochastic context-free grammars for modeling three spliceosomal small nuclear ribonucleic acids. Master thesis, University of California, Santa Cruz.

END

Outside Algorithm The outside probability,  (I,j,v), is the probability that starting from the start non-terminal the non-terminal v is generated and the string not dominated by it is ( x 1 …x i-1 ) to the left and ( x j+1 …x L ) to the right.  (i,j,v) = P( S  * x 1 … x i-1 vx j+1 … x L |G). The outside variable can be computed in a recursive manner starting with the largest excluded subsequence i-1 L  (i,j,v) =     (k,i-1,z)  (k,j,y) t y (z,v) +     (j+1,k,z)  (i,k,y) t y (v,z). y z k=1 y z k=j+I The probability that a non-terminal v derives the subword (i, j) is given as  (i,j,v)  (i,j,v)/P(x|G)

W2 WM a b c W3 W1 Terminal Symbols Non-terminal productions, indexed for easy lookup Grammar stored in linked lists index W4

Why SCFG ? More Powerful –evolution processes of mutation, insertion, deletion –interaction between basepairs C A A A G A C G G C A U C G G C U A GACGCAAGUC UCGGAAACGA

Some Application of SCFG modeling t-RNA – [Sakakibara et al., 1994a] – [Eddy and Durbin, 1994] snRNAs – [Underwood, 1994] a pseudoknotted biotin binder – [Brown and Wilson, 1995] snoRNA – [Lowe and Eddy, 1999] small subunit ribosomal RNA – [Brown, 1999]

3. Generate e and t e: probability for rules like W  a t: probability for rules like W  XY 4. Compute  (i,j, ) –the probability of a subtree rooted with the non-terminal deriving the subword ( x i … x j )  (i,j, ) = P(  * x i … x j |G) –computed recursively in a bottom up fashion starting with subwords of length one. M M j-1  (i,j, ) =     (i,k,y)  (k+1,j,z)t (y,z), y=1 z=1 k=I Working Procedure