Statistical NLP Spring 2010 Lecture 13: Parsing II Dan Klein – UC Berkeley.

Slides:



Advertisements
Similar presentations
Parsing German with Latent Variable Grammars Slav Petrov and Dan Klein UC Berkeley.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 11.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.
Probabilistic and Lexicalized Parsing CS Probabilistic CFGs: PCFGs Weighted CFGs –Attach weights to rules of CFG –Compute weights of derivations.
Exponential Decay Pruning for Bottom-Up Beam-Search Parsing Nathan Bodenstab, Brian Roark, Aaron Dunlop, and Keith Hall April 2010.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing Probabilistic Context Free Grammars (Chapter 14) Muhammed Al-Mulhem March 1,
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 3.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Natural Language Processing
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 11 Giuseppe Carenini.
Parsing with PCFG Ling 571 Fei Xia Week 3: 10/11-10/13/05.
Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.
Syntactic Parsing with CFGs CMSC 723: Computational Linguistics I ― Session #7 Jimmy Lin The iSchool University of Maryland Wednesday, October 14, 2009.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
PARSING David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
Hand video 
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
December 2004CSA3050: PCFGs1 CSA305: Natural Language Algorithms Probabilistic Phrase Structure Grammars (PCFGs)
1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.
May 2006CLINT-LN Parsing1 Computational Linguistics Introduction Parsing with Context Free Grammars.
ADVANCED PARSING David Kauchak CS159 – Fall 2014 some slides adapted from Dan Klein.
October 2005csa3180: Parsing Algorithms 11 CSA350: NLP Algorithms Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies.
Parsing I: Earley Parser CMSC Natural Language Processing May 1, 2003.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Page 1 Probabilistic Parsing and Treebanks L545 Spring 2000.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture August 2007.
Albert Gatt Corpora and Statistical Methods Lecture 11.
Probabilistic CKY Roger Levy [thanks to Jason Eisner]
Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 29– CYK; Inside Probability; Parse Tree construction) Pushpak Bhattacharyya CSE.
Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley.
Statistical NLP Spring 2010 Lecture 12: Parsing I Dan Klein – UC Berkeley.
CSA2050 Introduction to Computational Linguistics Parsing I.
Sentence Parsing Parsing 3 Dynamic Programming. Jan 2009 Speech and Language Processing - Jurafsky and Martin 2 Acknowledgement  Lecture based on  Jurafsky.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
csa3050: Parsing Algorithms 11 CSA350: NLP Algorithms Parsing Algorithms 1 Top Down Bottom-Up Left Corner.
Hand video 
Natural Language Processing Lecture 15—10/15/2015 Jim Martin.
December 2011CSA3202: PCFGs1 CSA3202: Human Language Technology Probabilistic Phrase Structure Grammars (PCFGs)
GRAMMARS David Kauchak CS457 – Spring 2011 some slides adapted from Ray Mooney.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
PARSING David Kauchak CS159 – Fall Admin Assignment 3 Quiz #1  High: 36  Average: 33 (92%)  Median: 33.5 (93%)
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
Statistical NLP Winter 2009
CSE 517 Natural Language Processing Winter 2015
Statistical NLP Spring 2011
Statistical NLP Spring 2011
LING/C SC 581: Advanced Computational Linguistics
Statistical NLP Spring 2011
Constraining Chart Parsing with Partial Tree Bracketing
Probabilistic Parsing
David Kauchak CS159 – Spring 2019
David Kauchak CS159 – Spring 2019
Presentation transcript:

Statistical NLP Spring 2010 Lecture 13: Parsing II Dan Klein – UC Berkeley

Classical NLP: Parsing  Write symbolic or logical rules:  Use deduction systems to prove parses from words  Minimal grammar on “Fed raises” sentence: 36 parses  Simple 10-rule grammar: 592 parses  Real-size grammar: many millions of parses  This scaled very badly, didn’t yield broad-coverage tools Grammar (CFG)Lexicon ROOT  S S  NP VP NP  DT NN NP  NN NNS NN  interest NNS  raises VBP  interest VBZ  raises … NP  NP PP VP  VBP NP VP  VBP NP PP PP  IN NP

Probabilistic Context-Free Grammars  A context-free grammar is a tuple  N : the set of non-terminals  Phrasal categories: S, NP, VP, ADJP, etc.  Parts-of-speech (pre-terminals): NN, JJ, DT, VB  T : the set of terminals (the words)  S : the start symbol  Often written as ROOT or TOP  Not usually the sentence non-terminal S  R : the set of rules  Of the form X  Y 1 Y 2 … Y k, with X, Y i  N  Examples: S  NP VP, VP  VP CC VP  Also called rewrites, productions, or local trees  A PCFG adds:  A top-down production probability per rule P(Y 1 Y 2 … Y k | X)

Treebank Sentences

Treebank Grammars  Need a PCFG for broad coverage parsing.  Can take a grammar right off the trees (doesn’t work well):  Better results by enriching the grammar (e.g., lexicalization).  Can also get reasonable parsers without lexicalization. ROOT  S1 S  NP VP.1 NP  PRP1 VP  VBD ADJP1 …..

Treebank Grammar Scale  Treebank grammars can be enormous  As FSAs, the raw grammar has ~10K states, excluding the lexicon  Better parsers usually make the grammars larger, not smaller NP PLURAL NOUN NOUNDET ADJ NOUN NP CONJ NP PP

Chomsky Normal Form  Chomsky normal form:  All rules of the form X  Y Z or X  w  In principle, this is no limitation on the space of (P)CFGs  N-ary rules introduce new non-terminals  Unaries / empties are “promoted”  In practice it’s kind of a pain:  Reconstructing n-aries is easy  Reconstructing unaries is trickier  The straightforward transformations don’t preserve tree scores  Makes parsing algorithms simpler! VP [VP  VBD NP  ] VBD NP PP [VP  VBD NP PP  ] VBD NP PP PP VP

A Recursive Parser  Will this parser work?  Why or why not?  Memory requirements? bestScore(X,i,j,s) if (j = i+1) return tagScore(X,s[i]) else return max score(X->YZ) * bestScore(Y,i,k) * bestScore(Z,k,j)

A Memoized Parser  One small change: bestScore(X,i,j,s) if (scores[X][i][j] == null) if (j = i+1) score = tagScore(X,s[i]) else score = max score(X->YZ) * bestScore(Y,i,k) * bestScore(Z,k,j) scores[X][i][j] = score return scores[X][i][j]

 Can also organize things bottom-up A Bottom-Up Parser (CKY) bestScore(s) for (i : [0,n-1]) for (X : tags[s[i]]) score[X][i][i+1] = tagScore(X,s[i]) for (diff : [2,n]) for (i : [0,n-diff]) j = i + diff for (X->YZ : rule) for (k : [i+1, j-1]) score[X][i][j] = max score[X][i][j], score(X->YZ) * score[Y][i][k] * score[Z][k][j] YZ X i k j

Unary Rules  Unary rules? bestScore(X,i,j,s) if (j = i+1) return tagScore(X,s[i]) else return max max score(X->YZ) * bestScore(Y,i,k) * bestScore(Z,k,j) max score(X->Y) * bestScore(Y,i,j)

CNF + Unary Closure  We need unaries to be non-cyclic  Can address by pre-calculating the unary closure  Rather than having zero or more unaries, always have exactly one  Alternate unary and binary layers  Reconstruct unary chains afterwards NP DTNN VP VBD NP DTNN VP VBDNP VP S SBAR VP SBAR

Alternating Layers bestScoreU(X,i,j,s) if (j = i+1) return tagScore(X,s[i]) else return max max score(X->Y) * bestScoreB(Y,i,j) bestScoreB(X,i,j,s) return max max score(X->YZ) * bestScoreU(Y,i,k) * bestScoreU(Z,k,j)

Memory  How much memory does this require?  Have to store the score cache  Cache size: |symbols|*n 2 doubles  For the plain treebank grammar:  X ~ 20K, n = 40, double ~ 8 bytes = ~ 256MB  Big, but workable.  Pruning: Beams  score[X][i][j] can get too large (when?)  Can keep beams (truncated maps score[i][j]) which only store the best few scores for the span [i,j]  Pruning: Coarse-to-Fine  Use a smaller grammar to rule out most X[i,j]  Much more on this later…

Time: Theory  How much time will it take to parse?  For each diff (<= n)  For each i (<= n)  For each rule X  Y Z  For each split point k Do constant work  Total time: |rules|*n 3  Something like 5 sec for an unoptimized parse of a 20-word sentences YZ X i k j

Time: Practice  Parsing with the vanilla treebank grammar:  Why’s it worse in practice?  Longer sentences “unlock” more of the grammar  All kinds of systems issues don’t scale ~ 20K Rules (not an optimized parser!) Observed exponent: 3.6

Same-Span Reachability ADJP ADVP FRAG INTJ NP PP PRN QP S SBAR UCP VP WHNP TOP LST CONJP WHADJP WHADVP WHPP NX NAC SBARQ SINV RRCSQX PRT

Rule State Reachability  Many states are more likely to match larger spans! Example: NP CC  NPCC 0nn-1 1 Alignment Example: NP CC NP  NPCC 0nn-k-1 n Alignments NP n-k

Agenda-Based Parsing  Agenda-based parsing is like graph search (but over a hypergraph)  Concepts:  Numbering: we number fenceposts between words  “Edges” or items: spans with labels, e.g. PP[3,5], represent the sets of trees over those words rooted at that label (cf. search states)  A chart: records edges we’ve expanded (cf. closed set)  An agenda: a queue which holds edges (cf. a fringe or open set) criticswritereviewswithcomputers PP

Word Items  Building an item for the first time is called discovery. Items go into the agenda on discovery.  To initialize, we discover all word items (with score 1.0). critics write reviews with computers critics[0,1], write[1,2], reviews[2,3], with[3,4], computers[4,5] AGENDA CHART [EMPTY]

Unary Projection  When we pop a word item, the lexicon tells us the tag item successors (and scores) which go on the agenda critics write reviews with computers criticswritereviewswithcomputers critics[0,1] write[1,2] NNS[0,1] reviews[2,3]with[3,4]computers[4,5] VBP[1,2]NNS[2,3]IN[3,4]NNS[4,5]

Item Successors  When we pop items off of the agenda:  Graph successors: unary projections (NNS  critics, NP  NNS)  Hypergraph successors: combine with items already in our chart  Enqueue / promote resulting items (if not in chart already)  Record backtraces as appropriate  Stick the popped edge in the chart (closed set)  Queries a chart must support:  Is edge X:[i,j] in the chart? (What score?)  What edges with label Y end at position j?  What edges with label Z start at position i? Y[i,j] with X  Y forms X[i,j] Y[i,j] and Z[j,k] with X  Y Z form X[i,k] YZ X

An Example criticswritereviewswithcomputers NNS VBPNNSINNNS NNS[0,1]VBP[1,2]NNS[2,3]IN[3,4]NNS[3,4]NP[0,1]NP[2,3]NP[4,5] NP VP[1,2]S[0,2] VP PP[3,5] PP VP[1,3] VP ROOT[0,2] S ROOT S S[0,3]VP[1,5] VP NP[2,5] NP ROOT[0,3]S[0,5]ROOT[0,5] S ROOT

Empty Elements  Sometimes we want to posit nodes in a parse tree that don’t contain any pronounced words:  These are easy to add to a chart parser!  For each position i, add the “word” edge  :[i,i]  Add rules like NP   to the grammar  That’s it! I like to parse empties  NP VP I want you to parse this sentence I want [ ] to parse this sentence

UCS / A*  With weighted edges, order matters  Must expand optimal parse from bottom up (subparses first)  CKY does this by processing smaller spans before larger ones  UCS pops items off the agenda in order of decreasing Viterbi score  A* search also well defined  You can also speed up the search without sacrificing optimality  Can select which items to process first  Can do with any “figure of merit” [Charniak 98]  If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03] X n0 ij

(Speech) Lattices  There was nothing magical about words spanning exactly one position.  When working with speech, we generally don’t know how many words there are, or where they break.  We can represent the possibilities as a lattice and parse these just as easily. I awe of van eyes saw a ‘ve an Ivan

Treebank PCFGs  Use PCFGs for broad coverage parsing  Can take a grammar right off the trees (doesn’t work well): ROOT  S 1 S  NP VP. 1 NP  PRP 1 VP  VBD ADJP 1 ….. ModelF1 Baseline72.0 [Charniak 96]

Conditional Independence?  Not every NP expansion can fill every NP slot  A grammar with symbols like “NP” won’t be context-free  Statistically, conditional independence too strong

Non-Independence  Independence assumptions are often too strong.  Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects).  Also: the subject and object expansions are correlated! All NPsNPs under SNPs under VP

Grammar Refinement  Example: PP attachment

Breaking Up the Symbols  We can relax independence assumptions by encoding dependencies into the PCFG symbols:  What are the most useful “features” to encode? Parent annotation [Johnson 98] Marking possessive NPs

Grammar Refinement  Structure Annotation [Johnson ’98, Klein&Manning ’03]  Lexicalization [Collins ’99, Charniak ’00]  Latent Variables [Matsuzaki et al. 05, Petrov et al. ’06]

The Game of Designing a Grammar  Annotation refines base treebank symbols to improve statistical fit of the grammar  Structural annotation

Typical Experimental Setup  Corpus: Penn Treebank, WSJ  Accuracy – F1: harmonic mean of per-node labeled precision and recall.  Here: also size – number of symbols in grammar.  Passive / complete symbols: NP, NP^S  Active / incomplete symbols: NP  NP CC  Training:sections02-21 Development:section22 (here, first 20 files) Test:section23

Vertical Markovization  Vertical Markov order: rewrites depend on past k ancestor nodes. (cf. parent annotation) Order 1 Order 2

Horizontal Markovization Order 1 Order 

Vertical and Horizontal  Examples:  Raw treebank: v=1, h=   Johnson 98:v=2, h=   Collins 99: v=2, h=2  Best F1: v=3, h=2v ModelF1Size Base: v=h=2v K

Unary Splits  Problem: unary rewrites used to transmute categories so a high-probability rule can be used. AnnotationF1Size Base K UNARY K Solution: Mark unary rewrite sites with -U

Tag Splits  Problem: Treebank tags are too coarse.  Example: Sentential, PP, and other prepositions are all marked IN.  Partial Solution:  Subdivide the IN tag. AnnotationF1Size Previous K SPLIT-IN K

Other Tag Splits  UNARY-DT: mark demonstratives as DT^U (“the X” vs. “those”)  UNARY-RB: mark phrasal adverbs as RB^U (“quickly” vs. “very”)  TAG-PA: mark tags with non-canonical parents (“not” is an RB^VP)  SPLIT-AUX: mark auxiliary verbs with –AUX [cf. Charniak 97]  SPLIT-CC: separate “but” and “&” from other conjunctions  SPLIT-%: “%” gets its own tag. F1Size K K K K K K

A Fully Annotated (Unlex) Tree

Some Test Set Results  Beats “first generation” lexicalized parsers.  Lots of room to improve – more complex models next. ParserLPLRF1CB0 CB Magerman Collins Unlexicalized Charniak Collins

The Game of Designing a Grammar  Annotation refines base treebank symbols to improve statistical fit of the grammar  Parent annotation [Johnson ’98]

 Annotation refines base treebank symbols to improve statistical fit of the grammar  Parent annotation [Johnson ’98]  Head lexicalization [Collins ’99, Charniak ’00] The Game of Designing a Grammar

 Annotation refines base treebank symbols to improve statistical fit of the grammar  Parent annotation [Johnson ’98]  Head lexicalization [Collins ’99, Charniak ’00]  Automatic clustering? The Game of Designing a Grammar

Manual Annotation  Manually split categories  NP: subject vs object  DT: determiners vs demonstratives  IN: sentential vs prepositional  Advantages:  Fairly compact grammar  Linguistic motivations  Disadvantages:  Performance leveled out  Manually annotated ModelF1 Naïve Treebank Grammar72.6 Klein & Manning ’0386.3

Automatic Annotation Induction  Advantages:  Automatically learned: Label all nodes with latent variables. Same number k of subcategories for all categories.  Disadvantages:  Grammar gets too large  Most categories are oversplit while others are undersplit. ModelF1 Klein & Manning ’ Matsuzaki et al. ’0586.7

Forward Learning Latent Annotations EM algorithm: X1X1 X2X2 X7X7 X4X4 X5X5 X6X6 X3X3 Hewasright.  Brackets are known  Base categories are known  Only induce subcategories Just like Forward-Backward for HMMs. Backward

Refinement of the DT tag DT DT-1 DT-2 DT-3 DT-4

Hierarchical refinement

Adaptive Splitting  Want to split complex categories more  Idea: split everything, roll back splits which were least useful

Adaptive Splitting  Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split  No loss in accuracy when 50% of the splits are reversed.

Adaptive Splitting Results ModelF1 Previous88.4 With 50% Merging89.5

Number of Phrasal Subcategories

Number of Lexical Subcategories

Final Results F1 ≤ 40 words F1 all words Parser Klein & Manning ’ Matsuzaki et al. ’ Collins ’ Charniak & Johnson ’ Petrov et. al

Learned Splits  Proper Nouns (NNP):  Personal pronouns (PRP): NNP-14Oct.Nov.Sept. NNP-12JohnRobertJames NNP-2J.E.L. NNP-1BushNoriegaPeters NNP-15NewSanWall NNP-3YorkFranciscoStreet PRP-0ItHeI PRP-1ithethey PRP-2itthemhim

 Relative adverbs (RBR):  Cardinal Numbers (CD): RBR-0furtherlowerhigher RBR-1morelessMore RBR-2earlierEarlierlater CD-7onetwoThree CD CD-11millionbilliontrillion CD CD CD Learned Splits

Exploiting Substructure  Each edge records all the ways it was built (locally)  Can recursively extract trees  A chart may represent too many parses to enumerate (how many?) criticswritereviewswithcomputers NP PP VP NP 67 newart NP VP S JJNNSVBP

Order Independence  A nice property:  It doesn’t matter what policy we use to order the agenda (FIFO, LIFO, random).  Why? Invariant: before popping an edge:  Any edge X[i,j] that can be directly built from chart edges and a single grammar rule is either in the chart or in the agenda.  Convince yourselves this invariant holds!  This will not be true once we get weighted parsers.

Efficient CKY  Lots of tricks to make CKY efficient  Most of them are little engineering details:  E.g., first choose k, then enumerate through the Y:[i,k] which are non-zero, then loop through rules by left child.  Optimal layout of the dynamic program depends on grammar, input, even system details.  Another kind is more critical:  Many X:[i,j] can be suppressed on the basis of the input string  We’ll see this later as figures-of-merit or A* heuristics

Dark Ambiguities  Dark ambiguities: most analyses are shockingly bad (meaning, they don’t have an interpretation you can get your mind around)  Unknown words and new usages  Solution: We need mechanisms to focus attention on the best ones, probabilistic techniques do this This analysis corresponds to the correct parse of “This will panic buyers ! ”

The Parsing Problem criticswritereviewswithcomputers NP PP VP NP 67 newart NP VP S

Non-Independence II  Who cares?  NB, HMMs, all make false assumptions!  For generation, consequences would be obvious.  For parsing, does it impact accuracy?  Symptoms of overly strong assumptions:  Rewrites get used where they don’t belong.  Rewrites get used too often or too rarely. In the PTB, this construction is for possessives

Lexicalization  Lexical heads important for certain classes of ambiguities (e.g., PP attachment):  Lexicalizing grammar creates a much larger grammar. (cf. next week)  Sophisticated smoothing needed  Smarter parsing algorithms  More data needed  How necessary is lexicalization?  Bilexical vs. monolexical selection  Closed vs. open class lexicalization