Statistical NLP Winter 2009 Lecture 17: Unsupervised Learning IV Tree-structured learning Roger Levy [thanks to Jason Eisner and Mike Frank for some slides]

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Albert Gatt Corpora and Statistical Methods Lecture 11.

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.

March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,

CKY Parsing Ling 571 Deep Processing Techniques for NLP January 12, 2011.

For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Statistical NLP: Lecture 11

Albert Gatt Corpora and Statistical Methods Lecture 8.

Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.

6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 11 Giuseppe Carenini.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Intro to NLP - J. Eisner1 Probabilistic CKY.

Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

Intro to NLP - J. Eisner1 The Expectation Maximization (EM) Algorithm … continued!

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

Intro to NLP - J. Eisner1 Earley’s Algorithm (1970) Nice combo of our parsing ideas so far:  no restrictions on the form of the grammar:  A.

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)

GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Parsing I: Earley Parser CMSC Natural Language Processing May 1, 2003.

PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.

Albert Gatt Corpora and Statistical Methods Lecture 11.

Probabilistic CKY Roger Levy [thanks to Jason Eisner]

Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.

Basic Parsing Algorithms: Earley Parser and Left Corner Parsing

PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

CPSC 422, Lecture 28Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 28 Nov, 18, 2015.

Natural Language Processing Lecture 15—10/15/2015 Jim Martin.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

NLP. Introduction to NLP Time flies like an arrow –Many parses –Some (clearly) more likely than others –Need for a probabilistic ranking method.

/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)

PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,

Speech and Language Processing SLP Chapter 13 Parsing.

PCFG estimation with EM The Inside-Outside Algorithm.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Natural Language Processing Vasile Rus

CSC 594 Topics in AI – Natural Language Processing

The Expectation Maximization (EM) Algorithm

Statistical NLP Winter 2009

Inside-Outside & Forward-Backward Algorithms are just Backprop

CSCI 5832 Natural Language Processing

CSCI 5832 Natural Language Processing

Earley’s Algorithm (1970) Nice combo of our parsing ideas so far:

Parsing and More Parsing

The Expectation Maximization (EM) Algorithm

CSCI 5832 Natural Language Processing

David Kauchak CS159 – Spring 2019

Presentation transcript:

Statistical NLP Winter 2009 Lecture 17: Unsupervised Learning IV Tree-structured learning Roger Levy [thanks to Jason Eisner and Mike Frank for some slides]

Inferring tree structure from raw text We’ve looked at a few types of unsupervised inference of linguistic structure: Words from unsegmented utterances Semantic word classes from word bags (=documents) Syntactic word classes from word sequences But a hallmark of language is recursive structure To a first approximation, this is trees How might we learn tree structures from raw text?

Is this the right problem? Before proceeding…are we tackling the right problem? We’ve already seen how the structures giving high likelihood to raw text may not be the structures we want Also, it’s fairly clear that kids don’t learn language structure from linguistic input alone… Real learning is cross-modal So why study unsupervised grammar induction from raw text? words: “blue rings” objects: rings, big bird words: “and green rings” objects: rings, big bird words: “and yellow rings” objects: rings, big bird words: “Bigbird! Do you want to hold the rings?” objects: big bird

Why induce grammars from raw text? I don’t think this is a trivial question to answer. But there are some answers: Practical: if unsupervised grammar induction worked well, it’d save a whole lot of annotation effort Practical: we need tree structure to get compositional semantics to work Scientific: unsupervised grammar induction may help us a place an upper bound on how much grammatical knowledge must be innate Theoretical: the models and algorithms that make unsupervised grammar induction work well may help us with other, related tasks (e.g., machine translation)

Plan of today We’ll start with a look at an early but influential clustering-based method (Clark, 2001) We’ll then talk about EM for PCFGs Then we’ll talk about the next influential work (setting the bar for all modern work), Klein & Manning (2004)

Distributional clustering for syntax How do we evaluate unsupervised grammar induction? Basically like we evaluate supervised parsing: through bracketing accuracy But this means that we don’t have to focus on inducing a grammar We can focus on inducing constituents in sentences instead This contrast has been called structure search versus parameter search

Distributional clustering for syntax (II) What makes a good constituent? Old insight from Zellig Harris (Chomsky’s teacher): something can appear in a consistent set of contexts Sarah saw a dog standing near a tree. Sarah saw a big dog standing near a tree. Sarah saw me standing near a tree. Sarah saw three posts standing near a tree. Therefore we could try to characterize the space of possible constituents by the contexts they appear in We could then do clustering in this space

Distributional clustering for syntax (III) Clark, 2001: k-means clustering of common tag sequences occurring in context distributions Applied to 12 million words from the British National Corpus The results were promising: (determiner) “the big dog” “the very big dog” “the dog” “the dog near the tree”

Distributional clustering for syntax (IV) But…there was a problem: Clustering picks up both constituents and things that can’t be constituents (distituents) “big dog the” “dog big the” “dog the” “dog and big” good at distinguishing constituent types bad at weeding out distituents

Distributional clustering for syntax (V) Clark 2001’s solution: “with real constituents, there is high mutual information between the symbol occurring before the putative constituent and the symbol after…” Filter out distituents on the basis of low mutual information (controlling for distance between symbols) constituents distituents

Distributional clustering for syntax (VI) The final step: train an SCFG using vanilla maximum- likelihood estimation (hand-label the clusters to get interpretability) Ten most frequent rules expanding “NP”

Structure vs. parameter search This did pretty well But the clustering-based structure-search approach is throwing out a hugely important type of information (did you see what it is?) …bracketing consistency! Sarah gave the big dog the most enthuisiastic hug. Parameter-estimation approaches automatically incorporate this type of information This motivated the work of Klein & Manning (2002) both can’t be a constituent at once

Interlude Before we discuss later work: how do we do parameter search for PCFGs? The EM algorithm is key But as with HMM learning, there are exponentially many structures to consider We need dynamic programming (but a different sort than for HMMs)

Review: EM for HMMs For the M-step, the fundamental quantity we need is the expected counts of state-to-state transitions Probability of getting from beginning to state s i at time t Probability of getting from state s j at time t+1 to the end Probability of transitioning from state s j at time t to state s j at time t+1, emitting o t αiαi βjβj a ij b ijo t

EM for PCFGs In PCFGs, for the M-step, the fundamental quantity is the expected counts of various rule rewrites the familiar inside probability something new: outside probability

Inside & outside probabilities In Manning & Schuetze (1999), here’s the picture: You’re interested in node N j outside probability of N j inside probability of N j

17 “Inside Probabilities”  Sum over all VP parses of “flies like an arrow”:  VP (1,5) = p( flies like an arrow | VP )  Sum over all S parses of “time flies like an arrow”:  S (0,5) = p( time flies like an arrow | S ) S NP time VP V flies PP P like NP Det an N arrow p( | S ) = p( S  NP VP | S ) * p( NP  time | NP ) * p( VP  V PP | VP ) * p( V  flies | V ) * …

Compute  Bottom-Up by CKY  VP (1,5) = p( flies like an arrow | VP ) = …  S (0,5) = p( time flies like an arrow | S ) =  NP (0,1) *  VP (1,5) * p( S  NP VP | S ) + … S NP time VP V flies PP P like NP Det an N arrow p( | S ) = p( S  NP VP | S ) * p( NP  time | NP ) * p( VP  V PP | VP ) * p( V  flies | V ) * …

time 1 flies 2 like 3 an 4 arrow 5 0 NP3 Vst3 NP10 S8 S13 NP24 S22 S27 1 NP4 VP4 NP18 S21 VP18 2 P2V5P2V5 PP12 VP16 3 Det1NP10 4 N8N8 1 S  NP VP 6 S  Vst NP 2 S  S PP 1 VP  V NP 2 VP  VP PP 1 NP  Det N 2 NP  NP PP 3 NP  NP NP 0 PP  P NP Compute  Bottom-Up by CKY

time 1 flies 2 like 3 an 4 arrow 5 0 NP2 -3 Vst2 -3 NP S 2 -8 S NP S S NP2 -4 VP 2 -4 NP S VP P2 -2 V2 -5 PP VP Det 2 -1 NP N S  NP VP 2 -6 S  Vst NP 2 -2 S  S PP 2 -1 VP  V NP 2 -2 VP  VP PP 2 -1 NP  Det N 2 -2 NP  NP PP 2 -3 NP  NP NP 2 -0 PP  P NP Compute  Bottom-Up by CKY S 2 -22

time 1 flies 2 like 3 an 4 arrow 5 0 NP2 -3 Vst2 -3 NP S 2 -8 S NP S S S NP2 -4 VP 2 -4 NP S VP P2 -2 V2 -5 PP VP Det 2 -1 NP N S  NP VP 2 -6 S  Vst NP 2 -2 S  S PP 2 -1 VP  V NP 2 -2 VP  VP PP 2 -1 NP  Det N 2 -2 NP  NP PP 2 -3 NP  NP NP 2 -0 PP  P NP Compute  Bottom-Up by CKY S 2 -27

time 1 flies 2 like 3 an 4 arrow 5 0 NP2 -3 Vst2 -3 NP S 2 -8 S NP S S NP2 -4 VP 2 -4 NP S VP P2 -2 V2 -5 PP VP Det 2 -1 NP N S  NP VP 2 -6 S  Vst NP 2 -2 S  S PP 2 -1 VP  V NP 2 -2 VP  VP PP 2 -1 NP  Det N 2 -2 NP  NP PP 2 -3 NP  NP NP 2 -0 PP  P NP The Efficient Version: Add as we go

time 1 flies 2 like 3 an 4 arrow 5 0 NP2 -3 Vst2 -3 NP S NP S NP2 -4 VP 2 -4 NP S VP P2 -2 V2 -5 PP VP Det 2 -1 NP N S  NP VP 2 -6 S  Vst NP 2 -2 S  S PP 2 -1 VP  V NP 2 -2 VP  VP PP 2 -1 NP  Det N 2 -2 NP  NP PP 2 -3 NP  NP NP 2 -0 PP  P NP The Efficient Version: Add as we go  PP (2,5)  S (0,2)  s (0,5)  s (0,2)*  PP (2,5)*p(S  S PP | S)

Compute  probs bottom-up (CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules X  Y Z  X (i,k) += p(X  Y Z | X) *  Y (i,j) *  Z (j,k) X YZ i jk what if you changed + to max? need some initialization up here for the width-1 case

 VP (1,5) = p(flies like an arrow | VP)  VP (1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP NP today V flies PP P like NP Det an N arrow  VP (1,5) *  VP (1,5) = p(time [VP flies like an arrow] today | S) “inside” the VP “outside” the VP

 VP (1,5) = p(flies like an arrow | VP)  VP (1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP NP today V flies PP P like NP Det an N arrow  VP (1,5) *  VP (1,5) = p(time flies like an arrow today & VP(1,5) | S) /  S (0,6) p(time flies like an arrow today | S) = p(VP(1,5) | time flies like an arrow today, S)

 VP (1,5) = p(flies like an arrow | VP)  VP (1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP NP today V flies PP P like NP Det an N arrow So  VP (1,5) *  VP (1,5) /  s (0,6) is probability that there is a VP here, given all of the observed data (words) strictly analogous to forward-backward in the finite-state case! Start C H C H C H

 V (1,2) = p(flies | V)  VP (1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP NP today V flies PP P like NP Det an N arrow So  VP (1,5) *  V (1,2) *  PP (2,5) /  s (0,6) is probability that there is VP  V PP here, given all of the observed data (words)  PP (2,5) = p(like an arrow | PP) … or is it?

sum prob over all position triples like (1,2,5) to get expected c(VP  V PP); reestimate PCFG!  V (1,2) = p(flies | V)  VP (1,5) = p(time VP today | S) Inside & Outside Probabilities S NP time VP NP today V flies PP P like NP Det an N arrow So  VP (1,5) * p(VP  V PP) *  V (1,2) *  PP (2,5) /  s (0,6) is probability that there is VP  V PP here (at 1-2-5), given all of the observed data (words)  PP (2,5) = p(like an arrow | PP)

30 Compute  probs bottom-up V flies PP P like NP Det an N arrow  V (1,2)  PP (2,5) VP  VP (1,5) p(flies | V) p(like an arrow | PP) p(V PP | VP) += * * p(flies like an arrow | VP) Summary:  VP (1,5) += p(V PP | VP) *  V (1,2) *  PP (2,5) inside 1,5inside 1,2inside 2,5

31 Compute  probs top-down (uses  probs as well) V flies PP P like NP Det an N arrow  PP (2,5)  V (1,2) VP  VP (1,5) VP NP today NP time += p(time VP today | S) * p(V PP | VP) * p( like an arrow | PP) p(time V like an arrow today | S) S Summary:  V (1,2) += p(V PP | VP) *  VP (1,5) *  PP (2,5) outside 1,2 outside 1,5inside 2,5

32 Compute  probs top-down (uses  probs as well) V flies PP P like NP Det an N arrow  V (1,2)  PP (2,5) VP  VP (1,5) VP NP today NP time += p(time VP today | S) * p(V PP | VP) * p( flies| V) p(time flies PP today | S) S Summary:  PP (2,5) += p(V PP | VP) *  VP (1,5) *  V (1,2) outside 2,5 outside 1,5inside 1,2

33 Compute  probs bottom-up When you build VP(1,5), from VP(1,2) and VP(2,5) during CKY, increment  VP (1,5) by p(VP  VP PP) *  VP (1,2) *  PP (2,5) Why?  VP (1,5) is total probability of all derivations p(flies like an arrow | VP) and we just found another. (See earlier slide of CKY chart.) VP flies PP P like NP Det an N arrow  VP (1,2)  PP (2,5) VP  VP (1,5)

Compute  probs bottom-up (CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules X  Y Z  X (i,k) += p(X  Y Z) *  Y (i,j) *  Z (j,k) X YZ i jk

Compute  probs top-down (reverse CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules X  Y Z  Y (i,j) += ???  Z (j,k) += ??? n downto 2 “unbuild” biggest first X YZ i jk

36 Compute  probs top-down After computing  during CKY, revisit constits in reverse order (i.e., bigger constituents first). When you “unbuild” VP(1,5) from VP(1,2) and VP(2,5), increment  VP (1,2) by  VP (1,5) * p(VP  VP PP) *  PP (2,5) and increment  PP (2,5) by  VP (1,5) * p(VP  VP PP) *  VP (1,2) S NP time VP NP today  VP (1,5)  VP (1,2)  PP (2,5) already computed on bottom-up pass  VP (1,2) is total prob of all ways to gen VP(1,2) and all outside words. VP PP  PP (2,5)  VP (1,2)

Compute  probs top-down (reverse CKY) for width := 2 to n (* build smallest first *) for i := 0 to n-width (* start *) let k := i + width (* end *) for j := i+1 to k-1 (* middle *) for all grammar rules X  Y Z  Y (i,j) +=  X (i,k) * p(X  Y Z) *  Z (j,k)  Z (j,k) +=  X (i,k) * p(X  Y Z) *  Y (i,j) n downto 2 “unbuild” biggest first X YZ i jk

Back to the actual history Early approaches: Pereira & Schabes (1992), Carroll & Charniak (1994) Initialize a simple PCFG using some “good guesses” Use EM to re-estimate Horrible failure

EM on a constituent-context model Klein & Manning (2002) used a generative model that constructed sentences through bracketings (1) choose a bracketing of a sentence (2) for each span in the bracketing, (independently) generate a yield and a context This is horribly deficient, but it worked well! yield context

Adding a dependency component Klein & Manning (2004) added a dependency component, which wasn’t deficient – a nice generative model! The dependency component introduced valence – the idea that a governor’s dependent-seeking behavior changes once it has a dependent

Constituent-context + dependency Some really nifty factorized-inference techniques allowed these models to be combined The results are very good

More recent developments Improved parameter search (Smith, 2006) Bayesian methods (Liang et al., 2007) Unsupervised all-subtrees parsing (Bod 2007)