Stochastic Context Free Grammars for RNA Structure Modeling

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Lecture # 8 Chapter # 4: Syntax Analysis. Practice Context Free Grammars a) CFG generating alternating sequence of 0’s and 1’s b) CFG in which no consecutive.
Transformational Grammars The Chomsky hierarchy of grammars Context-free grammars describe languages that regular grammars can’t Unrestricted Context-sensitive.
Learning HMM parameters
Stochastic Context Free Grammars for RNA Modeling CS 838 Mark Craven May 2001.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Advanced AI - Part II Luc De Raedt University of Freiburg WS 2004/2005 Many slides taken from Helmut Schmid.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Project 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May 16, 2001.
Sónia Martins Bruno Martins José Cruz IGC, February 20 th, 2008.
CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
More on Text Management. Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate.
Project No. 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
Lecture # 9 Chap 4: Ambiguous Grammar. 2 Chomsky Hierarchy: Language Classification A grammar G is said to be – Regular if it is right linear where each.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A.
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
PCFG estimation with EM The Inside-Outside Algorithm.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Hidden Markov Models BMI/CS 576
Chapter 3 – Describing Syntax
CSC 594 Topics in AI – Natural Language Processing
Parsing #1 Leonidas Fegaras.
Closed book, closed notes
David Rodriguez-Velazquez CS – 6800 Summer I
Stochastic Context-Free Grammars for Modeling RNA
Learning Sequence Motif Models Using Expectation Maximization (EM)
Hidden Markov Models - Training
Lecture 14 Grammars – Parse Trees– Normal Forms
Interpolated Markov Models for Gene Finding
RNA Secondary Structure Prediction
CSCI 5832 Natural Language Processing
Training Tree Transducers
Eukaryotic Gene Finding
Stochastic Context-Free Grammars for Modeling RNA
Hidden Markov Models Part 2: Algorithms
CSCI 5832 Natural Language Processing
Natural Language - General
Chapter 2: A Simple One Pass Compiler
Finite Automata and Formal Languages
Parsing Costas Busch - LSU.
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
CS : Language Technology for the Web/Natural Language Processing
CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.
The Cocke-Kasami-Younger Algorithm
Discrete Maths 13. Grammars Objectives
David Kauchak CS159 – Spring 2019
Presentation transcript:

Stochastic Context Free Grammars for RNA Structure Modeling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Mark Craven, Colin Dewey, and Anthony Gitter

Goals for Lecture Key concepts transformational grammars the Chomsky hierarchy context free grammars stochastic context free grammars parsing ambiguity Not covered, but here for your reference the Inside and Outside algorithms parameter learning via the Inside-Outside algorithm

Modeling RNA with Stochastic Context Free Grammars Consider tRNA genes 274 in yeast genome, ~1500 in human genome get transcribed, like protein-coding genes don’t get translated, therefore base statistics much different than protein-coding genes but secondary structure is conserved To recognize new tRNA genes, model known ones using stochastic context free grammars [Eddy & Durbin, 1994; Sakakibara et al. 1994] But what is a grammar?

Transformational Grammars A transformational grammar characterizes a set of legal strings The grammar consists of a set of abstract nonterminal symbols a set of terminal symbols (those that actually appear in strings) a set of productions

A Grammar for Stop Codons This grammar can generate the 3 stop codons: UAA, UAG, UGA With a grammar we can ask questions like what strings are derivable from the grammar? can a particular string be derived from the grammar? what sequence of productions can be used to derive a particular string from a given grammar?

The Derivation for UAG

The Parse Tree for UAG

Some Shorthand

The Chomsky Hierarchy unrestricted context-sensitive context-free regular A hierarchy of grammars defined by restrictions on productions

The Chomsky Hierarchy Regular grammars Context-free grammars Context-sensitive grammars Unrestricted grammars are nonterminals is a terminal are any sequence of terminals/nonterminals is any non-null sequence of terminals/nonterminals

CFGs and RNA Context free grammars are well suited to modeling RNA secondary structure because they can represent base pairing preferences A grammar for a 3-base stem with a loop of either GAAA or GCAA

CFGs and RNA Figure from: Sakakibara et al. Nucleic Acids Research, 1994

Ambiguity in Parsing “I shot an elephant in my pajamas. How he got in my pajamas, I’ll never know.” – Groucho Marx

An Ambiguous RNA Grammar With this grammar, there are 3 parses for the string GGGAACC s G C A

A Probabilistic Version of the Stop Codon Grammar 1.0 1.0 0.7 0.2 1.0 0.3 0.8 Each production has an associated probability Probabilities for productions with the same left-hand side sum to 1 This regular grammar has a corresponding Markov chain model

Stochastic Context Free Grammars (a. k. a Stochastic Context Free Grammars (a.k.a. Probabilistic Context Free Grammars) 0.25 0.25 0.25 0.25 0.1 0.4 0.4 0.1 0.25 0.25 0.25 0.25 0.8 0.2

Stochastic Grammars? …the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. Noam Chomsky (famed linguist) Every time I fire a linguist, the performance of the recognizer improves. Fred Jelinek (former head of IBM speech recognition group) Credit for pairing these quotes goes to Dan Jurafsky and James Martin, Speech and Language Processing

Three Key Questions How likely is a given sequence? the Inside algorithm What is the most probable parse for a given sequence? the Cocke-Younger-Kasami (CYK) algorithm How can we learn the SCFG parameters given a grammar and a set of sequences? the Inside-Outside algorithm

Chomsky Normal Form It is convenient to assume that our grammar is in Chomsky Normal Form; i.e. all productions are of the form: Any CFG can be put into Chomsky Normal Form right hand side consists of two nonterminals right hand side consists of a single terminal

Converting a Grammar to CNF

Parameter Notation For productions of the form , we’ll denote the associated probability parameters transition emission

Determining the Likelihood of a Sequence: The Inside Algorithm Dynamic programming method, analogous to the Forward algorithm Involves filling in a 3D matrix representing the probability of all parse subtrees rooted at nonterminal v for the subsequence from i to j

Determining the Likelihood of a Sequence: The Inside Algorithm v y z 1 i j L : the probability of all parse subtrees rooted at nonterminal v for the subsequence from i to j

Inside Calculation Example bA bC bG p G A C s bA bC bG p G G G A A C C

Determining the Likelihood of a Sequence: The Inside Algorithm v y z 1 L i j k k+1 M is the number of nonterminals in the grammar

The Inside Algorithm Initialization (for i = 1 to L, v = 1 to M) Iteration (for i = L-1 to 1, j = i+1 to L, v = 1 to M) Termination start nonterminal

Learning SCFG Parameters If we know the parse tree for each training sequence, learning the SCFG parameters is simple no hidden part of the problem during training count how often each parameter (i.e. production) is used normalize/smooth to get probabilities More commonly, there are many possible parse trees per sequence – we don’t know which one is correct thus, use an EM approach (Inside-Outside) iteratively determine expected # times each production is used consider all parses weight each by its probability set parameters to maximize likelihood given these counts

The Inside-Outside Algorithm We can learn the parameters of an SCFG from training sequences using an EM approach called Inside-Outside In the E-step, we determine the expected number of times each nonterminal is used in parses the expected number of times each production is used in parses In the M-step, we update our production probabilities

The Outside Algorithm S v y z 1 i j L : the probability of parse trees rooted at the start nonterminal, excluding the probability of all subtrees rooted at nonterminal v covering the subsequence from i to j

Outside Calculation Example bG bC G G G A A C C

The Outside Algorithm We can recursively calculate from values we’ve calculated for y The first case we consider is where v is used in productions of the form: z y v 1 L k j S i-1 i

The Outside Algorithm The second case we consider is where v is used in productions of the form: z y v 1 L k j S j+1 i

The Outside Algorithm Initialization Iteration (for i = 1 to L, j = L to i, v = 1 to M)

The Inside-Outside Algorithm We can learn the parameters of an SCFG from training sequences using an EM approach called Inside-Outside In the E-step, we determine the expected number of times each nonterminal is used in parses the expected number of times each production is used in parses In the M-step, we update our production probabilities

The Inside-Outside Algorithm The EM re-estimation equations (for 1 sequence) are: cases where v used to generate A to generate any subsequence

Finding the Most Likely Parse: The CYK Algorithm Involves filling in a 3D matrix representing the most probable parse subtree rooted at nonterminal v for the subsequence from i to j and a matrix for the traceback storing information about the production at the top of this parse subtree

The CYK Algorithm Initialization (for i = 1 to L, v = 1 to M) Iteration (for i = 1 to L - 1, j = i+1 to L, v = 1 to M) Termination start nonterminal

The CYK Algorithm Traceback Initialization: push (1, L, 1) on the stack Iteration: pop (i, j, v) // pop subsequence/nonterminal pair (y, z, k) = τ(i, j, v) // get best production identified by CYK if (y, z, k) == (0,0,0) // indicating a leaf attach xi as the child of v else attach y, z to parse tree as children of v push(i, k, y) push(k+1, j, z)

Comparison of SCFG Algorithms to HMM Algorithms optimal alignment Viterbi CYK probability of sequence forward inside EM parameter estimation forward-backward inside-outside memory complexity time complexity