Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley.

Slides:

Advertisements

Similar presentations

Statistical NLP Winter 2009

Advertisements

Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley.

Improved Inference for Unlexicalized Parsing Slav Petrov and Dan Klein.

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.

Albert Gatt Corpora and Statistical Methods Lecture 11.

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.

Exponential Decay Pruning for Bottom-Up Beam-Search Parsing Nathan Bodenstab, Brian Roark, Aaron Dunlop, and Keith Hall April 2010.

Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.

1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.

March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 3.

Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.

PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.

6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 11 Giuseppe Carenini.

1/13 Parsing III Probabilistic Parsing and Conclusions.

Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

What is the internet? 

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

Statistical NLP Spring 2010 Lecture 13: Parsing II Dan Klein – UC Berkeley.

Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.

1 Statistical Parsing Chapter 14 October 2012 Lecture #9.

December 2004CSA3050: PCFGs1 CSA305: Natural Language Algorithms Probabilistic Phrase Structure Grammars (PCFGs)

SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa

1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.

ADVANCED PARSING David Kauchak CS159 – Fall 2014 some slides adapted from Dan Klein.

Parsing I: Earley Parser CMSC Natural Language Processing May 1, 2003.

PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.

Albert Gatt Corpora and Statistical Methods Lecture 11.

Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.

PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.

NLP. Introduction to NLP The probabilities don’t depend on the specific words –E.g., give someone something (2 arguments) vs. see something (1 argument)

Supertagging CMSC Natural Language Processing January 31, 2006.

CPSC 422, Lecture 28Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 28 Nov, 18, 2015.

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.

December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)

PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)

Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.

Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:

Natural Language Processing Vasile Rus

CSC 594 Topics in AI – Natural Language Processing

CSE 517 Natural Language Processing Winter 2015

Statistical NLP Spring 2011

Statistical NLP Spring 2011

Statistical NLP Spring 2011

Probabilistic and Lexicalized Parsing

LING/C SC 581: Advanced Computational Linguistics

CSCI 5832 Natural Language Processing

Probabilistic and Lexicalized Parsing

CSCI 5832 Natural Language Processing

Constraining Chart Parsing with Partial Tree Bracketing

Probabilistic Parsing

David Kauchak CS159 – Spring 2019

David Kauchak CS159 – Spring 2019

Presentation transcript:

Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley

Treebank PCFGs  Use PCFGs for broad coverage parsing  Can take a grammar right off the trees (doesn’t work well): ROOT  S 1 S  NP VP. 1 NP  PRP 1 VP  VBD ADJP 1 ….. ModelF1 Baseline72.0 [Charniak 96]

Conditional Independence?  Not every NP expansion can fill every NP slot  A grammar with symbols like “NP” won’t be context-free  Statistically, conditional independence too strong

Non-Independence  Independence assumptions are often too strong.  Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects).  Also: the subject and object expansions are correlated! All NPsNPs under SNPs under VP

Grammar Refinement  Example: PP attachment

Grammar Refinement  Structure Annotation [Johnson ’98, Klein&Manning ’03]  Lexicalization [Collins ’99, Charniak ’00]  Latent Variables [Matsuzaki et al. 05, Petrov et al. ’06]

The Game of Designing a Grammar  Annotation refines base treebank symbols to improve statistical fit of the grammar  Structural annotation

Typical Experimental Setup  Corpus: Penn Treebank, WSJ  Accuracy – F1: harmonic mean of per-node labeled precision and recall.  Here: also size – number of symbols in grammar.  Passive / complete symbols: NP, NP^S  Active / incomplete symbols: NP  NP CC  Training:sections02-21 Development:section22 (here, first 20 files) Test:section23

Vertical Markovization  Vertical Markov order: rewrites depend on past k ancestor nodes. (cf. parent annotation) Order 1 Order 2

Horizontal Markovization Order 1 Order 

Unary Splits  Problem: unary rewrites used to transmute categories so a high-probability rule can be used. AnnotationF1Size Base K UNARY K Solution: Mark unary rewrite sites with -U

Tag Splits  Problem: Treebank tags are too coarse.  Example: Sentential, PP, and other prepositions are all marked IN.  Partial Solution:  Subdivide the IN tag. AnnotationF1Size Previous K SPLIT-IN K

Other Tag Splits  UNARY-DT: mark demonstratives as DT^U (“the X” vs. “those”)  UNARY-RB: mark phrasal adverbs as RB^U (“quickly” vs. “very”)  TAG-PA: mark tags with non-canonical parents (“not” is an RB^VP)  SPLIT-AUX: mark auxiliary verbs with –AUX [cf. Charniak 97]  SPLIT-CC: separate “but” and “&” from other conjunctions  SPLIT-%: “%” gets its own tag. F1Size K K K K K K

A Fully Annotated (Unlex) Tree

Some Test Set Results  Beats “first generation” lexicalized parsers.  Lots of room to improve – more complex models next. ParserLPLRF1CB0 CB Magerman Collins Unlexicalized Charniak Collins

 Annotation refines base treebank symbols to improve statistical fit of the grammar  Structural annotation [Johnson ’98, Klein and Manning 03]  Head lexicalization [Collins ’99, Charniak ’00] The Game of Designing a Grammar

Problems with PCFGs  If we do no annotation, these trees differ only in one rule:  VP  VP PP  NP  NP PP  Parse will go one way or the other, regardless of words  We addressed this in one way with unlexicalized grammars (how?)  Lexicalization allows us to be sensitive to specific words

Problems with PCFGs  What’s different between basic PCFG scores here?  What (lexical) correlations need to be scored?

Lexicalized Trees  Add “headwords” to each phrasal node  Syntactic vs. semantic heads  Headship not in (most) treebanks  Usually use head rules, e.g.:  NP:  Take leftmost NP  Take rightmost N*  Take rightmost JJ  Take right child  VP:  Take leftmost VB*  Take leftmost VP  Take left child

Lexicalized PCFGs?  Problem: we now have to estimate probabilities like  Never going to get these atomically off of a treebank  Solution: break up derivation into smaller steps

Lexical Derivation Steps  A derivation of a local tree [Collins 99] Choose a head tag and word Choose a complement bag Generate children (incl. adjuncts) Recursively derive children

Lexicalized CKY bestScore(X,i,j,h) if (j = i+1) return tagScore(X,s[i]) else return max max score(X[h]->Y[h] Z[h’]) * bestScore(Y,i,k,h) * bestScore(Z,k,j,h’) max score(X[h]->Y[h’] Z[h]) * bestScore(Y,i,k,h’) * bestScore(Z,k,j,h) Y[h]Z[h’] X[h] i h k h’ j k,h’,X->YZ (VP->VBD  )[saw] NP[her] (VP->VBD...NP  )[saw] k,h’,X->YZ

Pruning with Beams  The Collins parser prunes with per-cell beams [Collins 99]  Essentially, run the O(n 5 ) CKY  Remember only a few hypotheses for each span.  If we keep K hypotheses at each span, then we do at most O(nK 2 ) work per span (why?)  Keeps things more or less cubic  Also: certain spans are forbidden entirely on the basis of punctuation (crucial for speed) Y[h]Z[h’] X[h] i h k h’ j

Pruning with a PCFG  The Charniak parser prunes using a two-pass approach [Charniak 97+]  First, parse with the base grammar  For each X:[i,j] calculate P(X|i,j,s)  This isn’t trivial, and there are clever speed ups  Second, do the full O(n 5 ) CKY  Skip any X :[i,j] which had low (say, < ) posterior  Avoids almost all work in the second phase!  Charniak et al 06: can use more passes  Petrov et al 07: can use many more passes

Pruning with A*  You can also speed up the search without sacrificing optimality  For agenda-based parsers:  Can select which items to process first  Can do with any “figure of merit” [Charniak 98]  If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03] X n0 ij

Projection-Based A* Factory payrolls fell in Sept. NPPP VP S Factory payrolls fell in Sept. payrolls in fell Factory payrolls fell in Sept. NP:payrolls PP:in VP:fell S:fell

A* Speedup  Total time dominated by calculation of A* tables in each projection… O(n 3 )

Results  Some results  Collins 99 – 88.6 F1 (generative lexical)  Charniak and Johnson 05 – 89.7 / 91.3 F1 (generative lexical / reranked)  Petrov et al 06 – 90.7 F1 (generative unlexical)  McClosky et al 06 – 92.1 F1 (gen + rerank + self-train)  However  Bilexical counts rarely make a difference (why?)  Gildea 01 – Removing bilexical counts costs < 0.5 F1

 Annotation refines base treebank symbols to improve statistical fit of the grammar  Structural annotation  Head lexicalization  Automatic clustering? The Game of Designing a Grammar

Latent Variable Grammars Parse Tree Sentence Parameters... Derivations

Forward Learning Latent Annotations EM algorithm: X1X1 X2X2 X7X7 X4X4 X5X5 X6X6 X3X3 Hewasright.  Brackets are known  Base categories are known  Only induce subcategories Just like Forward-Backward for HMMs. Backward

Refinement of the DT tag DT DT-1 DT-2 DT-3 DT-4

Hierarchical refinement

Hierarchical Estimation Results ModelF1 Flat Training87.3 Hierarchical Training88.4

Refinement of the, tag  Splitting all categories equally is wasteful:

Adaptive Splitting  Want to split complex categories more  Idea: split everything, roll back splits which were least useful

Adaptive Splitting Results ModelF1 Previous88.4 With 50% Merging89.5

Number of Phrasal Subcategories

Number of Lexical Subcategories

Learned Splits  Proper Nouns (NNP):  Personal pronouns (PRP): NNP-14Oct.Nov.Sept. NNP-12JohnRobertJames NNP-2J.E.L. NNP-1BushNoriegaPeters NNP-15NewSanWall NNP-3YorkFranciscoStreet PRP-0ItHeI PRP-1ithethey PRP-2itthemhim

 Relative adverbs (RBR):  Cardinal Numbers (CD): RBR-0furtherlowerhigher RBR-1morelessMore RBR-2earlierEarlierlater CD-7onetwoThree CD CD-11millionbilliontrillion CD CD CD Learned Splits

Coarse-to-Fine Inference  Example: PP attachment ?????????

Prune? For each chart item X[i,j], compute posterior probability: …QPNPVP… coarse: refined: E.g. consider the span 5 to 12: < threshold

Bracket Posteriors

Hierarchical Pruning …QPNPVP… coarse: split in two: …QP1QP2NP1NP2VP1VP2… …QP1 QP3QP4NP1NP2NP3NP4VP1VP2VP3VP4… split in four: split in eight: ……………………………………………

Final Results (Accuracy) ≤ 40 words F1 all F1 ENG Charniak&Johnson ‘05 (generative) Split / Merge GER Dubey ‘ Split / Merge CHN Chiang et al. ‘ Split / Merge Still higher numbers from reranking / self-training methods

Bracket Posteriors (after G 0 )

Bracket Posteriors (after G 1 )

Bracket Posteriors (Final)

Bracket Posteriors (Best Tree)

G1G2G3G4G5G6G1G2G3G4G5G6 Learning Parsing times X-Bar= G 0 G= 60 % 12 % 7 % 6 % 5 % 4 %

1621 min 111 min 35 min 15 min (no search error)

Breaking Up the Symbols  We can relax independence assumptions by encoding dependencies into the PCFG symbols:  What are the most useful “features” to encode? Parent annotation [Johnson 98] Marking possessive NPs

Non-Independence II  Who cares?  NB, HMMs, all make false assumptions!  For generation, consequences would be obvious.  For parsing, does it impact accuracy?  Symptoms of overly strong assumptions:  Rewrites get used where they don’t belong.  Rewrites get used too often or too rarely. In the PTB, this construction is for possessives

Lexicalization  Lexical heads important for certain classes of ambiguities (e.g., PP attachment):  Lexicalizing grammar creates a much larger grammar. (cf. next week)  Sophisticated smoothing needed  Smarter parsing algorithms  More data needed  How necessary is lexicalization?  Bilexical vs. monolexical selection  Closed vs. open class lexicalization

Lexical Derivation Steps  Derivation of a local tree [simplified Charniak 97] VP[saw] VBD[saw] NP[her] NP[today] PP[on] VBD[saw] (VP->VBD  )[saw] NP[today] (VP->VBD...NP  )[saw] NP[her] (VP->VBD...NP  )[saw] (VP->VBD...PP  )[saw] PP[on] VP[saw] Still have to smooth with mono- and non- lexical backoffs

Manual Annotation  Manually split categories  NP: subject vs object  DT: determiners vs demonstratives  IN: sentential vs prepositional  Advantages:  Fairly compact grammar  Linguistic motivations  Disadvantages:  Performance leveled out  Manually annotated ModelF1 Naïve Treebank Grammar72.6 Klein & Manning ’0386.3

Automatic Annotation Induction  Advantages:  Automatically learned: Label all nodes with latent variables. Same number k of subcategories for all categories.  Disadvantages:  Grammar gets too large  Most categories are oversplit while others are undersplit. ModelF1 Klein & Manning ’ Matsuzaki et al. ’0586.7

Adaptive Splitting  Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split  No loss in accuracy when 50% of the splits are reversed.

Final Results (Accuracy) ≤ 40 words F1 all F1 ENG Charniak&Johnson ‘05 (generative) Petrov and Klein GER Dubey ‘ Petrov and Klein CHN Chiang et al. ‘ Petrov and Klein

Final Results F1 ≤ 40 words F1 all words Parser Klein & Manning ’ Matsuzaki et al. ’ Collins ’ Charniak & Johnson ’ Petrov et. al

Vertical and Horizontal  Examples:  Raw treebank: v=1, h=   Johnson 98:v=2, h=   Collins 99: v=2, h=2  Best F1: v=3, h=2v ModelF1Size Base: v=h=2v K

Problems with PCFGs  Another example of PCFG indifference  Left structure far more common  How to model this?  Really structural: “chicken with potatoes with salad”  Lexical parsers model this effect, but not by virtue of being lexical

Naïve Lexicalized Parsing  Can, in principle, use CKY on lexicalized PCFGs  O(Rn 3 ) time and O(Sn 2 ) memory  But R = rV 2 and S = sV  Result is completely impractical (why?)  Memory: 10K rules * 50K words * (40 words) 2 * 8 bytes  6TB  Can modify CKY to exploit lexical sparsity  Lexicalized symbols are a base grammar symbol and a pointer into the input sentence, not any arbitrary word  Result: O(rn 5 ) time, O(sn 3 )  Memory: 10K rules * (40 words) 3 * 8 bytes  5GB

Quartic Parsing  Turns out, you can do (a little) better [Eisner 99]  Gives an O(n 4 ) algorithm  Still prohibitive in practice if not pruned Y[h]Z[h’] X[h] i h k h’ j Y[h] Z X[h] i h k j