Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein.

Slides:



Advertisements
Similar presentations
Intro to NLP - J. Eisner1 Learning in the Limit Golds Theorem.
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Inside-outside algorithm LING 572 Fei Xia 02/28/06.
Normal forms for Context-Free Grammars
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
Thanks to Nir Friedman, HU
More on Text Management. Context Free Grammars Context Free Grammars are a more natural model for Natural Language Syntax rules are very easy to formulate.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Statistical NLP Winter 2009 Lecture 17: Unsupervised Learning IV Tree-structured learning Roger Levy [thanks to Jason Eisner and Mike Frank for some slides]
Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging.
BİL711 Natural Language Processing1 Statistical Parse Disambiguation Problem: –How do we disambiguate among a set of parses of a given sentence? –We want.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Context-Free Grammars Normal Forms Chapter 11. Normal Forms A normal form F for a set C of data objects is a form, i.e., a set of syntactically valid.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
Some Probability Theory and Computational models A short overview.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz.
Text Models Continued HMM and PCFGs. Recap So far we have discussed 2 different models for text – Bag of Words (BOW) where we introduced TF-IDF Location.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-17: Probabilistic parsing; inside- outside probabilities.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Statistical Decision-Tree Models for Parsing NLP lab, POSTECH 김 지 협.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b김병창 NLP Lab
Supertagging CMSC Natural Language Processing January 31, 2006.
CSE 517 Natural Language Processing Winter 2015
Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Mid-Terms Exam Scope and Introduction. Format Grades: 100 points -> 20% in the final grade Multiple Choice Questions –8 questions, 7 points each Short.
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Chapter 3 Language Acquisition: A Linguistic Treatment Jang, HaYoung Biointelligence Laborotary Seoul National University.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
PCFG estimation with EM The Inside-Outside Algorithm.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Natural Language Processing Vasile Rus
Context-Free Grammars: an overview
CSC 594 Topics in AI – Natural Language Processing
CHAPTER 2 Context-Free Languages
N-Gram Model Formulas Word sequences Chain rule of probability
David Kauchak CS159 – Spring 2019
Presentation transcript:

Learning PCFGs: Estimating Parameters, Learning Grammar Rules Many slides are taken or adapted from slides by Dan Klein

Treebanks An example tree from the Penn Treebank

The Penn Treebank 1 million tokens In 50,000 sentences, each labeled with – A POS tag for each token – Labeled constituents – “Extra” information Phrase annotations like “TMP” “empty” constituents for wh-movement traces, empty subjects for raising constructions

Supervised PCFG Learning 1.Preprocess the treebank 1.Remove all “extra” information (empties, extra annotations) 2.Convert to Chomsky Normal Form 3.Possibly prune some punctuation, lower-case all words, compute word “shapes”, and other processing to combat sparsity. 2.Count the occurrence of each nonterminal c(N) and each observed production rule c(N->N L N R ) and c(N->w) 3.Set the probability for each rule to the MLE: P(N->N L N R ) = c(N->N L N R ) / c(N) P(N->w) = c(N->w) / c(N) Easy, peasy, lemon-squeezy.

Complications Smoothing – Especially for lexicalized grammars, many test productions will never be observed during training – We don’t necessarily want to assign these productions zero probability – Instead, define backoff distributions, e.g.: P final (VP transmogrified -> V transmogrified PP into ) = α P(VP transmogrified -> V transmogrified PP into ) + (1- α ) P(VP -> V PP into )

Problems with Supervised PCFG Learning Coming up with labeled data is hard! – Time-consuming – Expensive – Hard to adapt to new domains, tasks, languages – Corpus availability drives research (instead of tasks driving the research) Penn Treebank took many person-years to manually annotate it.

Unsupervised Learning of PCFGS: Feasible?

Unsupervised Learning Systems take raw data and automatically detect data Why? – More data is available – Kids learn (some aspects of) language with no supervision – Insights into machine learning and clustering

Grammar Induction and Learnability Some have argued that learning syntax from positive data alone is impossible – Gold, 1967: non-identifiability in the limit – Chomsky, 1980: poverty of the stimulus Surprising result: it’s possible to get entirely unsupervised parsing to work (reasonably) well.

Learnability Learnability: formal conditions under which a class of languages can be learned Setup: – Class of languages Λ – Algorithm H (the learner) – H sees a sequence X of strings x1 … xn – H maps sequences X to languages L in Λ Question is: for what classes Λ do learners H exist?

Learnability [Gold, 1967] Criterion: Identification in the limit – A presentation of L is an infinite sequence of x’s from L in which each x occurs at least once – A learner H identifies L in the limit if, for any presentation of L, from some point n onwards, H always outputs L – A class Λ is identifiable in the limit if there is some single H which correctly identifies in the limit every L in Λ. Example: L = {{a},{a,b}} is identifiable in the limit. Theorem (Gold, 67): Any Λ which contains all finite languages and at least one infinite language (ie is superfinite) is unlearnable in this sense.

Learnability [Gold, 1967] Proof sketch – Assume Λ is superfinite, H identifies Λ in the limit – There exists a chain L 1 ⊂ L 2 ⊂ … L ∞ – Construct the following misleading sequence Present strings from L1 until H outputs L1 Present strings from L2 until H outputs L2 … – This is a presentation of L ∞ but H never outputs L ∞

Learnability [Horning, 1969] Problem, IIL requires that H succeeds on all examples, even the weird ones Another criterion: measure one identification – Assume a distribution P L (x) for each L – Assume P L (x) puts non-zero probability on all and only the x in L – Assume an infinite presentation of x drawn i.i.d. from P L (x) – H measure-one identifies L if the probability of [drawing a sequence X from which H can identify L] is 1. Theorem (Horning, 69): PCFGs can be identified in this sense. – Note: there can be misleading sequences, but they have to be (infinitely) unlikely

Learnability [Horning, 1969] Proof sketch – Assume Λ is a recursively enumerable set of recursive languages (e.g., the set of all PCFGs) – Assume an ordering on all strings x1 < x2 < … – Define: two sequences A and B agree through n iff for all x<x n, x is in A  x is in B. – Define the error set E(L,n,m): All sequences such that the first m elements do not agree with L through n These are the sequences which contain early strings outside of L (can’t happen), or which fail to contain all of the early strings in L (happens less as m increases) – Claim: P(E(L,n,m)) goes to 0 as m goes to ∞ – Let d L (n) be the smallest m such that P(E) < 2 –n – Let d(n) be the largest d L (n) in first n languages – Learner: after d(n), pick first L that agrees with evidence through n – This can only fail for sequences X if X keeps showing up in E(L, n, d(n)), which happens infinitely often with probability zero.

Learnability Gold’s results say little about real learners (the requirements are too strong) Horning’s algorithm is completely impractical – It needs astronomical amounts of data Even measure-one identification doesn’t say anything about tree structures – It only talks about learning grammatical sets – Strong generative vs. weak generative capacity

Unsupervised POS Tagging Some (discouraging) experiments [Merialdo 94] Setup: – You know the set of allowable tags for each word (but not frequency of each tag) – Learn a supervised model on k training sentences Learn P(w|t), P(t i |t i-1,t i-2 ) on these sentences – On n>k, reestimate with EM

Merialdo: Results

Grammar Induction Unsupervised Learning of Grammars and Parameters

Right-branching Baseline In English (but not necessarily in other languages), trees tend to be right-branching: A simple, English-specific baseline is to choose the right chain structure for each sentence.

Distributional Clustering

Nearest Neighbors

Learn PCFGs with EM [Lari and Young, 1990] Setup: – Full binary grammar with n nonterminals {X1, …, Xn} (that is, at the beginning, the grammar has all possible rules) – Parse uniformly/randomly at first – Re-estimate rule expecations off of parses – Repeat Their conclusion: it doesn’t really work

EM for PCFGs: Details 1.Start with a “full” grammar, with all possible binary rules for our nonterminals N 1 … N k. Designate one of them as the start symbol, say N 1 2.Assign some starting distribution to the rules, such as 1.Random 2.Uniform 3.Some “smart” initialization techniques (see assigned reading) 3.E-step: Take an unannotated sentence S, and compute, for all nonterminals N, N L, N R, and all terminals w: E(N | S), E(N->N L N R, N is used| S), E(N->w, N is used| S) 4.M-step: Reset rule probabilities to the MLE: P(N->N L N R ) = E(N->N L N R |S) / E(N | S) P(N->w) = E(N->w | S) / E(N | S) 5.Repeat 3 and 4 until rule probabilities stabilize, or “converge”

E-Step Let We can define the expectations we want in terms of π, α, β quantities:

Inside Probabilities Base case: Induction: NjNj NlNl NrNr wpwp wdwd w d+1 wqwq

Outside Probabilities Base case: Induction: NjNj NlNl NrNr wpwp wdwd w d+1 wqwq

Problem: Model Symmetries

Distributional Syntax?

Problem: Identifying Constituents

A nested distributional model We’d like a model that – Ties spans to linear contexts (like distributional clustering) – Considers only proper tree structures (like PCFGs) – Has no symmetries to break (like a dependency model)

Constituent Context Model (CCM)

Results: Constituency

Results: Dependencies

Results: Combined Models

Multilingual Results