Download presentation
Presentation is loading. Please wait.
Published byMay Short Modified over 9 years ago
1
Statistical NLP Spring 2010 Lecture 14: PCFGs Dan Klein – UC Berkeley
2
Treebank PCFGs Use PCFGs for broad coverage parsing Can take a grammar right off the trees (doesn’t work well): ROOT S 1 S NP VP. 1 NP PRP 1 VP VBD ADJP 1 ….. ModelF1 Baseline72.0 [Charniak 96]
3
Conditional Independence? Not every NP expansion can fill every NP slot A grammar with symbols like “NP” won’t be context-free Statistically, conditional independence too strong
4
Non-Independence Independence assumptions are often too strong. Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects). Also: the subject and object expansions are correlated! All NPsNPs under SNPs under VP
5
Grammar Refinement Example: PP attachment
6
Grammar Refinement Structure Annotation [Johnson ’98, Klein&Manning ’03] Lexicalization [Collins ’99, Charniak ’00] Latent Variables [Matsuzaki et al. 05, Petrov et al. ’06]
7
The Game of Designing a Grammar Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation
8
Typical Experimental Setup Corpus: Penn Treebank, WSJ Accuracy – F1: harmonic mean of per-node labeled precision and recall. Here: also size – number of symbols in grammar. Passive / complete symbols: NP, NP^S Active / incomplete symbols: NP NP CC Training:sections02-21 Development:section22 (here, first 20 files) Test:section23
9
Vertical Markovization Vertical Markov order: rewrites depend on past k ancestor nodes. (cf. parent annotation) Order 1 Order 2
10
Horizontal Markovization Order 1 Order
11
Unary Splits Problem: unary rewrites used to transmute categories so a high-probability rule can be used. AnnotationF1Size Base77.87.5K UNARY78.38.0K Solution: Mark unary rewrite sites with -U
12
Tag Splits Problem: Treebank tags are too coarse. Example: Sentential, PP, and other prepositions are all marked IN. Partial Solution: Subdivide the IN tag. AnnotationF1Size Previous78.38.0K SPLIT-IN80.38.1K
13
Other Tag Splits UNARY-DT: mark demonstratives as DT^U (“the X” vs. “those”) UNARY-RB: mark phrasal adverbs as RB^U (“quickly” vs. “very”) TAG-PA: mark tags with non-canonical parents (“not” is an RB^VP) SPLIT-AUX: mark auxiliary verbs with –AUX [cf. Charniak 97] SPLIT-CC: separate “but” and “&” from other conjunctions SPLIT-%: “%” gets its own tag. F1Size 80.48.1K 80.58.1K 81.28.5K 81.69.0K 81.79.1K 81.89.3K
14
A Fully Annotated (Unlex) Tree
15
Some Test Set Results Beats “first generation” lexicalized parsers. Lots of room to improve – more complex models next. ParserLPLRF1CB0 CB Magerman 9584.984.684.71.2656.6 Collins 9686.385.886.01.1459.9 Unlexicalized86.985.786.31.1060.3 Charniak 9787.487.587.41.0062.1 Collins 9988.788.6 0.9067.1
16
Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation [Johnson ’98, Klein and Manning 03] Head lexicalization [Collins ’99, Charniak ’00] The Game of Designing a Grammar
17
Problems with PCFGs If we do no annotation, these trees differ only in one rule: VP VP PP NP NP PP Parse will go one way or the other, regardless of words We addressed this in one way with unlexicalized grammars (how?) Lexicalization allows us to be sensitive to specific words
18
Problems with PCFGs What’s different between basic PCFG scores here? What (lexical) correlations need to be scored?
19
Lexicalized Trees Add “headwords” to each phrasal node Syntactic vs. semantic heads Headship not in (most) treebanks Usually use head rules, e.g.: NP: Take leftmost NP Take rightmost N* Take rightmost JJ Take right child VP: Take leftmost VB* Take leftmost VP Take left child
20
Lexicalized PCFGs? Problem: we now have to estimate probabilities like Never going to get these atomically off of a treebank Solution: break up derivation into smaller steps
21
Lexical Derivation Steps A derivation of a local tree [Collins 99] Choose a head tag and word Choose a complement bag Generate children (incl. adjuncts) Recursively derive children
22
Lexicalized CKY bestScore(X,i,j,h) if (j = i+1) return tagScore(X,s[i]) else return max max score(X[h]->Y[h] Z[h’]) * bestScore(Y,i,k,h) * bestScore(Z,k,j,h’) max score(X[h]->Y[h’] Z[h]) * bestScore(Y,i,k,h’) * bestScore(Z,k,j,h) Y[h]Z[h’] X[h] i h k h’ j k,h’,X->YZ (VP->VBD )[saw] NP[her] (VP->VBD...NP )[saw] k,h’,X->YZ
23
Pruning with Beams The Collins parser prunes with per-cell beams [Collins 99] Essentially, run the O(n 5 ) CKY Remember only a few hypotheses for each span. If we keep K hypotheses at each span, then we do at most O(nK 2 ) work per span (why?) Keeps things more or less cubic Also: certain spans are forbidden entirely on the basis of punctuation (crucial for speed) Y[h]Z[h’] X[h] i h k h’ j
24
Pruning with a PCFG The Charniak parser prunes using a two-pass approach [Charniak 97+] First, parse with the base grammar For each X:[i,j] calculate P(X|i,j,s) This isn’t trivial, and there are clever speed ups Second, do the full O(n 5 ) CKY Skip any X :[i,j] which had low (say, < 0.0001) posterior Avoids almost all work in the second phase! Charniak et al 06: can use more passes Petrov et al 07: can use many more passes
25
Pruning with A* You can also speed up the search without sacrificing optimality For agenda-based parsers: Can select which items to process first Can do with any “figure of merit” [Charniak 98] If your figure-of-merit is a valid A* heuristic, no loss of optimiality [Klein and Manning 03] X n0 ij
26
Projection-Based A* Factory payrolls fell in Sept. NPPP VP S Factory payrolls fell in Sept. payrolls in fell Factory payrolls fell in Sept. NP:payrolls PP:in VP:fell S:fell
27
A* Speedup Total time dominated by calculation of A* tables in each projection… O(n 3 )
28
Results Some results Collins 99 – 88.6 F1 (generative lexical) Charniak and Johnson 05 – 89.7 / 91.3 F1 (generative lexical / reranked) Petrov et al 06 – 90.7 F1 (generative unlexical) McClosky et al 06 – 92.1 F1 (gen + rerank + self-train) However Bilexical counts rarely make a difference (why?) Gildea 01 – Removing bilexical counts costs < 0.5 F1
29
Annotation refines base treebank symbols to improve statistical fit of the grammar Structural annotation Head lexicalization Automatic clustering? The Game of Designing a Grammar
30
Latent Variable Grammars Parse Tree Sentence Parameters... Derivations
31
Forward Learning Latent Annotations EM algorithm: X1X1 X2X2 X7X7 X4X4 X5X5 X6X6 X3X3 Hewasright. Brackets are known Base categories are known Only induce subcategories Just like Forward-Backward for HMMs. Backward
32
Refinement of the DT tag DT DT-1 DT-2 DT-3 DT-4
33
Hierarchical refinement
34
Hierarchical Estimation Results ModelF1 Flat Training87.3 Hierarchical Training88.4
35
Refinement of the, tag Splitting all categories equally is wasteful:
36
Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were least useful
37
Adaptive Splitting Results ModelF1 Previous88.4 With 50% Merging89.5
38
Number of Phrasal Subcategories
39
Number of Lexical Subcategories
40
Learned Splits Proper Nouns (NNP): Personal pronouns (PRP): NNP-14Oct.Nov.Sept. NNP-12JohnRobertJames NNP-2J.E.L. NNP-1BushNoriegaPeters NNP-15NewSanWall NNP-3YorkFranciscoStreet PRP-0ItHeI PRP-1ithethey PRP-2itthemhim
41
Relative adverbs (RBR): Cardinal Numbers (CD): RBR-0furtherlowerhigher RBR-1morelessMore RBR-2earlierEarlierlater CD-7onetwoThree CD-4198919901988 CD-11millionbilliontrillion CD-0150100 CD-313031 CD-9785834 Learned Splits
42
Coarse-to-Fine Inference Example: PP attachment ?????????
43
Prune? For each chart item X[i,j], compute posterior probability: …QPNPVP… coarse: refined: E.g. consider the span 5 to 12: < threshold
44
Bracket Posteriors
45
Hierarchical Pruning …QPNPVP… coarse: split in two: …QP1QP2NP1NP2VP1VP2… …QP1 QP3QP4NP1NP2NP3NP4VP1VP2VP3VP4… split in four: split in eight: ……………………………………………
46
Final Results (Accuracy) ≤ 40 words F1 all F1 ENG Charniak&Johnson ‘05 (generative)90.189.6 Split / Merge90.690.1 GER Dubey ‘0576.3- Split / Merge80.880.1 CHN Chiang et al. ‘0280.076.6 Split / Merge86.383.4 Still higher numbers from reranking / self-training methods
47
Bracket Posteriors (after G 0 )
48
Bracket Posteriors (after G 1 )
49
Bracket Posteriors (Final)
50
Bracket Posteriors (Best Tree)
51
G1G2G3G4G5G6G1G2G3G4G5G6 Learning Parsing times X-Bar= G 0 G= 60 % 12 % 7 % 6 % 5 % 4 %
52
1621 min 111 min 35 min 15 min (no search error)
53
Breaking Up the Symbols We can relax independence assumptions by encoding dependencies into the PCFG symbols: What are the most useful “features” to encode? Parent annotation [Johnson 98] Marking possessive NPs
54
Non-Independence II Who cares? NB, HMMs, all make false assumptions! For generation, consequences would be obvious. For parsing, does it impact accuracy? Symptoms of overly strong assumptions: Rewrites get used where they don’t belong. Rewrites get used too often or too rarely. In the PTB, this construction is for possessives
55
Lexicalization Lexical heads important for certain classes of ambiguities (e.g., PP attachment): Lexicalizing grammar creates a much larger grammar. (cf. next week) Sophisticated smoothing needed Smarter parsing algorithms More data needed How necessary is lexicalization? Bilexical vs. monolexical selection Closed vs. open class lexicalization
56
Lexical Derivation Steps Derivation of a local tree [simplified Charniak 97] VP[saw] VBD[saw] NP[her] NP[today] PP[on] VBD[saw] (VP->VBD )[saw] NP[today] (VP->VBD...NP )[saw] NP[her] (VP->VBD...NP )[saw] (VP->VBD...PP )[saw] PP[on] VP[saw] Still have to smooth with mono- and non- lexical backoffs
57
Manual Annotation Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional Advantages: Fairly compact grammar Linguistic motivations Disadvantages: Performance leveled out Manually annotated ModelF1 Naïve Treebank Grammar72.6 Klein & Manning ’0386.3
58
Automatic Annotation Induction Advantages: Automatically learned: Label all nodes with latent variables. Same number k of subcategories for all categories. Disadvantages: Grammar gets too large Most categories are oversplit while others are undersplit. ModelF1 Klein & Manning ’0386.3 Matsuzaki et al. ’0586.7
59
Adaptive Splitting Evaluate loss in likelihood from removing each split = Data likelihood with split reversed Data likelihood with split No loss in accuracy when 50% of the splits are reversed.
60
Final Results (Accuracy) ≤ 40 words F1 all F1 ENG Charniak&Johnson ‘05 (generative)90.189.6 Petrov and Klein 0790.690.1 GER Dubey ‘0576.3- Petrov and Klein 0780.880.1 CHN Chiang et al. ‘0280.076.6 Petrov and Klein 0786.383.4
61
Final Results F1 ≤ 40 words F1 all words Parser Klein & Manning ’0386.385.7 Matsuzaki et al. ’0586.786.1 Collins ’9988.688.2 Charniak & Johnson ’0590.189.6 Petrov et. al. 0690.289.7
62
Vertical and Horizontal Examples: Raw treebank: v=1, h= Johnson 98:v=2, h= Collins 99: v=2, h=2 Best F1: v=3, h=2v ModelF1Size Base: v=h=2v77.87.5K
63
Problems with PCFGs Another example of PCFG indifference Left structure far more common How to model this? Really structural: “chicken with potatoes with salad” Lexical parsers model this effect, but not by virtue of being lexical
64
Naïve Lexicalized Parsing Can, in principle, use CKY on lexicalized PCFGs O(Rn 3 ) time and O(Sn 2 ) memory But R = rV 2 and S = sV Result is completely impractical (why?) Memory: 10K rules * 50K words * (40 words) 2 * 8 bytes 6TB Can modify CKY to exploit lexical sparsity Lexicalized symbols are a base grammar symbol and a pointer into the input sentence, not any arbitrary word Result: O(rn 5 ) time, O(sn 3 ) Memory: 10K rules * (40 words) 3 * 8 bytes 5GB
65
Quartic Parsing Turns out, you can do (a little) better [Eisner 99] Gives an O(n 4 ) algorithm Still prohibitive in practice if not pruned Y[h]Z[h’] X[h] i h k h’ j Y[h] Z X[h] i h k j
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.