Project No. 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16 2001.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

RNA Secondary Structure Prediction
Pushdown Automata Consists of –Pushdown stack (can have terminals and nonterminals) –Finite state automaton control Can do one of three actions (based.
Stochastic Context Free Grammars for RNA Modeling CS 838 Mark Craven May 2001.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov Models Modified from:
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
1 CSC 3130: Automata theory and formal languages Tutorial 4 KN Hung Office: SHB 1026 Department of Computer Science & Engineering.
CS Master – Introduction to the Theory of Computation Jan Maluszynski - HT Lecture 4 Context-free grammars Jan Maluszynski, IDA, 2007
Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001.
Some new sequencing technologies
Sónia Martins Bruno Martins José Cruz IGC, February 20 th, 2008.
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
Inside-outside algorithm LING 572 Fei Xia 02/28/06.
RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.
CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Syntactic Pattern Recognition Statistical PR:Find a feature vector x Train a system using a set of labeled patterns Classify unknown patterns Ignores relational.
More on Text Management. Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate.
1 The Pumping Lemma for Context-Free Languages. 2 Take an infinite context-free language Example: Generates an infinite number of different strings.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
INHERENT LIMITATIONS OF COMPUTER PROGRAMS CSci 4011.
RNA multiple sequence alignment Craig L. Zirbel October 14, 2010.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging.
Lecture 16 Oct 18 Context-Free Languages (CFL) - basic definitions Examples.
Hidden Markov Models for Sequence Analysis 4
Probabilistic Context Free Grammars for Representing Action Song Mao November 14, 2000.
CONVERTING TO CHOMSKY NORMAL FORM
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
CSCI 2670 Introduction to Theory of Computing September 21, 2004.
Some Probability Theory and Computational models A short overview.
The CYK Algorithm Presented by Aalapee Patel Tyler Ondracek CS6800 Spring 2014.
Membership problem CYK Algorithm Project presentation CS 5800 Spring 2013 Professor : Dr. Elise de Doncker Presented by : Savitha parur venkitachalam.
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki.
Remote RNA homology detection Sean R. Eddy HHMI Janelia Farm Research Campus Probability theory is nothing but common sense reduced to calculation. Laplace.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A.
The Chinese University of Hong Kong
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
Stochastic Context Free Grammars CBB 261 for noncoding RNA gene prediction B. Majoros.
Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.
Transparency No. 1 Formal Language and Automata Theory Homework 5.
Exercises on Chomsky Normal Form and CYK parsing
The estimation of stochastic context-free grammars using the Inside-Outside algorithm Oh-Woog Kwon KLE Lab. CSE POSTECH.
Lecture # 10 Grammar Problems. Problems with grammar Ambiguity Left Recursion Left Factoring Removal of Useless Symbols These can create problems for.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Stochastic Context-Free Grammars for Modeling RNA
Simplifications of Context-Free Grammars
Stochastic Context-Free Grammars for Modeling RNA
Stochastic Context Free Grammars for RNA Structure Modeling
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

Project No. 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May

Why SCFG ? More Powerful –evolution processes of mutation, insertion, deletion –interaction between basepairs C A A A G A C G G C A U C G G C U A GACGCAAGUC UCGGAAACGA

Problems can be solved by SCFG Scoring problem –compute P( x| G) –inside algorithm Alignment problem –find the most likely generating path –Cocke-Younger-Kasami (CYK) algorithm Training problem –parameter re-estimation –inside-outside algorithm

Normal Form for SCFG SCFGs can have unlimited symbols on the right hand side of the production S  ABC(0.9) A  a(0.5) B  b(0.8)C  c(0.6) Chomsky Normal Form –requires all production rules to be of the form W  XY or W  a S  AD (0.9) D  BC (1.0) A  a (0.5) B  b (0.8) C  c (0.6)

Example Grammar A  BA (0.4) A  CA (0.2) A  a (0.3) A  g (0.1) B  CA (0.4) B  g (0.3) B  t (0.3) C  CA (0.6) C  t (0.2) C  c (0.2) A B C a g t c G E A A B C A B C T B A B C C A B C grammarmatrix

Inside Algorithm Compute  (i, j, ) –the probability of a subtree rooted with the non-terminal deriving the subword ( x i … x j )  (i, j, ) = P(  * x i … x j |G) –computed recursively in a bottom up fashion starting with subwords of length one. M M j-1  (i, j, ) =     (i,k,y)  (k+1,j,z)t (y,z) y=1 z=1 k=I

Inside Algorithm Basic Idea: –calculate P start with subsequences of length 1 –then subsequences of length 2 –… … –continue working on longer and longer subsequences –until, a probability is determined for the complete parse tree rooted at the start non-terminal

Example for Inside Algorithm Input sequence: tagNon-terminals: A B C 1. calculate P for subsequences of length 1 2. calculate P for subsequences of length 2 A  t A  a A  g B  t B  a B  g C  t C  a C  g B  ta A  ag B  ta B  ag C  ta C  ag 3. calculate P for subsequences of length 3 A  tag Copy from matrix E A  BA B  ta A  g A  CA C  ta A  g B  CA C  t A  a A  ta

Outside Algorithm Compute  (i, j, ) –the probability that starting from the start non-terminal the non-terminal v is generated, and the string not dominated by it is ( x 1 …x i-1 ) to the left and ( x j+1 …x L ) to the right.  (i,j,v) = P( S  * x 1 … x i-1 vx j+1 … x L |G).

Outside Algorithm Compute  (i, j, ) –computed recursively starting with the largest excluded subsequence i-1  (i,j,v) =     (k,i-1,z)  (k,j,y) t y (z,v) + y z k=1 L     (j+1,k,z)  (i,k,y) t y (v,z). y z k=j+I

Outside Algorithm Basic Idea:... x1x1 x2x2 x i-1 xixi... xjxj x j+1... xLxL S v  (i, j, )

Outside Algorithm Basic Idea:... x1x1 xkxk x i-1... xjxj xLxL S z  (i, j, )... xixi v y

Outside Algorithm Basic Idea:... x1x1 xkxk x j+1... xLxL S v  (i, j, )... xixi z y xjxj

CYK Algorithm Basic Idea: –find the most likely generating path starting with subsequences of length 1 –then subsequences of length 2 –… … –continue working on longer and longer subsequences –until, a path is determined for the complete parse tree rooted at the start non-terminal

Example for CYK Algorithm Input sequence: tagNon-terminals: A B C 1. Find path for subsequences of length 1 2. Find path for subsequences of length 2 A  t A  a A  g B  t B  a B  g C  t C  a C  g B  ta A  ag B  ta B  ag C  ta C  ag 3. Find path for subsequences of length 3 A  tag Copy from matrix E A  BA B  ta A  g B  CA C  t A  a A  ta

References 1. [Brown and Wilson, 1995] Brown, M.P.S. and Wilson, C. Rna pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. In Hunter, L. and Klein, T., editors. Pacific Synposium on Biocomputing, pages [Brown 1999] Brown, M.P.S., “RNA Modeling Using Stochastic Context- Free Grammars”, ph.D thesis. 3. [Eddy and Durbin, 1994] Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models. NAR, 22: [Krogh et al., 1994] Krogh, A., Brown, M., Mian, I. S., Sjolander, K., and Haussler, D. Hidden Markov models in computational biology: Applications to protein modeling. JMB, 235: [Lowe and Eddy, 1999] Lowe, T. and Eddy, S. A computational screen for methylation guide snornas in yeast. Science, 283: [Sakakibara et al., 1994] Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjolander, K., Underwood, R. C., and Haussler, D. Stochastic context-free grammars for tRNA modeling. NAR, 22: [Underwood, 1994] Underwood, R. C. Stochastic context-free grammars for modeling three spliceosomal small nuclear ribonucleic acids. Master thesis, University of California, Santa Cruz.

A Test System User interfaces Compute P( x| G) Find Viterbi path  Primary target: modeling a small sequence using SCFG

Display Grammar Window

Calculating Probability for “tag”

Calculating Probability for “tgacg”

Finding Most Likely Path for “tag”

Finding Most Likely Path for “tgacg”

Finding Most Likely Path for “tgtacggta”

Training Algorithm Basic Idea: –an original grammar, G, which can model a family of sequence, but not good enough –feed it with sequence x –compute the probability of P(x|G) –adjust the G to be G`