Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Grammars, constituency and order A grammar describes the legal strings of a language in terms of constituency and order. For example, a grammar for a fragment.
GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
September PROBABILISTIC CFGs & PROBABILISTIC PARSING Universita’ di Venezia 3 Ottobre 2003.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Probabilistic Context Free Grammars Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parsing with PCFG Ling 571 Fei Xia Week 3: 10/11-10/13/05.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Inside-outside algorithm LING 572 Fei Xia 02/28/06.
Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
More on Text Management. Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate.
Context-Free Grammar CSCI-GA.2590 – Lecture 3 Ralph Grishman NYU.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Albert Gatt Corpora and Statistical Methods Lecture 9.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
1 *Introduction to Natural Language Processing ( ) Parsing: Introduction Dr. Jan Hajiè CS Dept., Johns Hopkins Univ.
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
Parsing I: Earley Parser CMSC Natural Language Processing May 1, 2003.
Text Models Continued HMM and PCFGs. Recap So far we have discussed 2 different models for text – Bag of Words (BOW) where we introduced TF-IDF Location.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
1 Introduction to Natural Language Processing ( ) Parsing: Introduction.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-17: Probabilistic parsing; inside- outside probabilities.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
Chapter 23: Probabilistic Language Models April 13, 2004.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-16: Probabilistic parsing; computing probability of.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 26– Recap HMM; Probabilistic Parsing cntd) Pushpak Bhattacharyya CSE Dept., IIT.
Chapter 3 Describing Syntax and Semantics
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Evaluating Models of Computation and Storage in Human Sentence Processing Thang Luong CogACLL 2015 Tim J. O’Donnell & Noah D. Goodman.
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
NLP. Introduction to NLP Time flies like an arrow –Many parses –Some (clearly) more likely than others –Need for a probabilistic ranking method.
/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
PCFG estimation with EM The Inside-Outside Algorithm.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
General Information on Context-free and Probabilistic Context-free Grammars İbrahim Hoça CENG784, Fall 2013.
1 Statistical methods in NLP Diana Trandabat
CS : Speech, NLP and the Web/Topics in AI
CSCI 5832 Natural Language Processing
N-Gram Model Formulas Word sequences Chain rule of probability
Stochastic Context Free Grammars for RNA Structure Modeling
CS : Language Technology for the Web/Natural Language Processing
David Kauchak CS159 – Spring 2019
CS 224n / Lx 237 section Tuesday, May
David Kauchak CS159 – Spring 2019
Presentation transcript:

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07

Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require a nonlinear model that reflects the hierarchical structure of sentences rather than the linear order of words. For example, HMM taggers will have difficulties with long range dependencies like in The velocity of the seismic waves rises to …

Phrase hierarchical structure Verb agrees in number with the noun velocity which is the head of the preceding noun phrase.

PCFGs Probabilistic Context Free Grammars are the simplest and most natural probabilistic model for tree structures and the algorithms for them are closely related to those for HMMs. Note, however, that there are other ways of building probabilistic models of syntactic structure (see Chapter 12).

Formal Definition of PCFGs A PCFG consists of: A set of terminals, {w k }, k= 1, …,V A set of nonterminals, N i, i= 1, …, n A designated start symbol N 1 A set of rules, {N i -->  j }, (where  j is a sequence of terminals and nonterminals) A corresponding set of probabilities on rules such that:  i  j P(N i -->  j ) = 1 The probability of a sentence (according to grammar G) is given by:. P(w 1m ) =  t P(w 1m,t) where t is a parse tree of the sentence.

Notation

Domination A non-terminal N j is said to dominate the words w a … w b if it is possible to rewrite the non-terminal N j as a sequence of words w a … w b. Alternative notations: yield( Nj) = w a … w b or N j  w a … w b *

Assumptions of the Model Place Invariance: The probability of a subtree does not depend on where in the string the words it dominates are. Context Free: The probability of a subtree does not depend on words not dominated by the subtree. Ancestor Free: The probability of a subtree does not depend on nodes in the derivation outside the subtree.

Example: simple PCFG S --> NP VP 1.0 PP --> P NP 1.0 VP --> V NP 0.7 VP --> VP PP 0.3 P --> with 1.0 V --> saw 1.0 NP --> NP PP 0.4 NP --> astronomers 0.1 NP --> ears 0.18 NP --> saw 0.04 NP --> stars 0.18 NP --> telescopes 0.1 P(astronomers saw stars with ears) = ?

Example – parsing according to simple grammar P(t1) = P(t2) = P(w 1m ) = P(t1) + P(t2) =

Some Features of PCFGs A PCFG gives some idea of the plausibility (reasonably true) of different parses. However, the probabilities are based on structural factors and not lexical ones. PCFG are good for grammar induction. PCFGs are robust. PCFGs give a probabilistic language model for English. The predictive power of a PCFG (measured by entropy) tends to be greater than for an HMM. PCFGs are not good models alone but they can be combined with a tri-gram model. PCFGs have certain biases which may not be appropriate.

Improper probability & PCFGs Is  w P (w) =1 always satisfied for PCFGs? Consider the grammar generating the language ab P = 1/3 ababP = 2/27 abababP = 8/243  w P (w) =1/3+2/27+8/243+40/ … = ½ This is improper probability distribution. PCFGs learned from parsed training corpus always give a proper probability distribution (Chi, Geman 1998) S  ab P = 1/3 S  S S P = 2/3

Questions for PCFGs Just as for HMMs, there are three basic questions we wish to answer: What is the probability of a sentence w 1m according to a grammar G: P(w 1m |G)? What is the most likely parse for a sentence: argmax t P(t|w 1m,G)? How can we choose rule probabilities for the grammar G that maximize the probability of a sentence, argmax G P(w 1m |G) ?

Restriction Here, we only consider the case of Chomsky Normal Form Grammars, which only have unary and binary rules of the form: N i --> N j N k N i --> w j The parameters of a PCFG in Chomsky Normal Form are: P(N j --> N r N s | G), an n 3 matrix of parameters P(N j --> w k |G), nV parameters (where n is the number of nonterminals and V is the number of terminals)  r,s P(N j --> N r N s ) +  k P (N j --> w k ) =1

From HMMs to Probabilistic Regular Grammars (PRG) A PRG has start state N 1 and rules of the form: N i --> w j N k N i --> w j This is similar to what we had for an HMM except that in an HMM, we have  n  w1n P(w 1n ) = 1 whereas in a PCFG, we have  w  L P(w) = 1 where L is the language generated by the grammar and w is a sentence in the language. To see the difference consider the probablity P(John decide to bake a) modeled by HMM and by PCFG. PRG are related to HMMs in that a PRG is a HMM to which we should add a start state and a finish (or sink) state.

From PRGs to PCFGs In the HMM, we were able to efficiently do calculations in terms of forward and backward probabilities. In a parse tree, the forward probability corresponds to everything above and including a certain node, while the backward probability corresponds to the probability of everything below a certain node. We introduce outside (  j ) and inside (  j ) Probs.:  j (p,q)=P(w 1(p-1), N pq j,w (q+1)m |G)  j (p,q)=P(w pq |N pq j, G)

Inside and outside probabilities  j (p,q) and  j (p,q)

The Probability of a String I: Using Inside Probabilities We use the Inside Algorithm, a dynamic programming algorithm based on the inside probabilities: P(w 1m |G) = P(N 1 ==>* w 1m |G). = P(w 1m |N 1m 1, G) =  1 (1,m) Base Case:  j (k,k) = P(w k |N kk j, G) =P(N j -> w k |G) Induction:  j (p,q) =  r,s  d=p q-1 P(N j -> N r N s )  r (p,d)  s (d+1,q)

Inside probs - inductive step

The simple PCFG S --> NP VP 1.0 PP --> P NP 1.0 VP --> V NP 0.7 VP --> VP PP 0.3 P --> with 1.0 V --> saw 1.0 NP --> NP PP 0.4 NP --> astronomers 0.1 NP --> ears 0.18 NP --> saw 0.04 NP --> stars 0.18 NP --> telescopes 0.1 P(astronomers saw stars with ears) = ?

The Probability of a String II: example of inside probabilities  NP = 0.1  S =  S =  NP = 0.04  V = 1.0  VP =  VP =  NP =  NP =  P = 1.0  PP =  NP = astronomerssawstarswithears

The Probability of a String III: Using Outside Probabilities We use the Outside Algorithm based on the outside probabilities: P(w 1m |G)=  j P(w 1(k-1) w k w (k+1)m N j kk |G) =  j  j (k,k)P(N j --> w k ) for any k Base Case:  1 (1,m)= 1;  j (1,m)=0 for j  1 Inductive Case: the node N j pq might be either on the left branch or the right branch of the parent node

outside probs - inductive step We sum over the two contributions

outside probs - inductive step II

Combining inside and outside Similarly to the HMM, we can combine the inside and the outside probabilities: P(w 1m, N pq |G)=  j  j (p,q)  j (p,q)

Finding the Most Likely Parse for a Sentence The algorithm works by finding the highest probability partial parse tree spanning a certain substring that is rooted with a certain nonterminal.  i (p,q) = the highest inside probability parse of a subtree N pq i Initialization:  i (p,p) = P(Ni --> w p ) Induction:  i (p,q) = max j,k,p  r N j N k )  j (p,r)  k (r+1,q) Store backtrace:  i (p,q)=argmax (j,k,r) P(N i --> N j N k )  j (p,r)  k (r+1,q) Termination: P(t ^ )=  1 (1,m)

Training a PCFG Restrictions: We assume that the set of rules is given in advance and we try to find the optimal probabilities to assign to different grammar rules. Like for the HMMs, we use an EM Training Algorithm called the Inside-Outside Algorithm which allows us to train the parameters of a PCFG on unannotated sentences of the language. Basic Assumption: a good grammar is one that makes the sentences in the training corpus likely to occur ==> we seek the grammar that maximizes the likelihood of the training data.

Problems with the inside- outside algorithm Extremely Slow: For each sentence, each iteration of training is O(m 3 n 3 ). Local Maxima are much more of a problem than in HMMs Satisfactory learning requires many more nonterminals than are theoretically needed to describe the language. There is no guarantee that the learned nonterminals will be linguistically motivated.