Exponential Decay Pruning for Bottom-Up Beam-Search Parsing Nathan Bodenstab, Brian Roark, Aaron Dunlop, and Keith Hall April 2010
2 Talk Outline Intro to Syntactic Parsing –Why Parse? Parsing Algorithms –CYK –Best-First –Beam-Search Exponential Decay Pruning Results
3 Intro to Syntactic Parsing Hierarchically cluster and label syntactic word groups (constituents) Provides structure and meaning
4 Intro to Syntactic Parsing Why Parse? –Machine Translation Synchronous Grammars –Language Understanding Semantic Role Labeling Word Sense Disambiguation Question-Answering Document Summarization –Language Modeling Long-distance dependencies –Because it’s fun
5 Intro to Syntactic Parsing What you (usually) need to parse –Supervised data: A treebank of sentences with annotated parse structure WSJ treebank: 50k sentences –A Binarized Probabilistic Context Free Grammar induced from a treebank –A parsing algorithm Example grammar rules: –S NP VP prob=0.2 –NP NP NN prob=0.1 –NP JJ NN prob=0.06 –Binarize: VP PP VB NN VP VB NN prob=0.5
6 Parsing Accuracy Non- terminals Grammar Size Sec / Sent F-Score Baseline2,50064, % Parent Annotation (Johnson)6,00075, % Manual Refinement (Klein)15,00086% Latent Variable (Petrov)1,1004,000, % Lexical (Collins, Charniak)LotsImplicit89% Accuracy Improvements from grammar refinement –Split original non-terminal categories (Subject-NP vs. Object-NP) –Accuracy at the cost of speed Solution space becomes impractical to exhaustively search
7 Berkeley Grammar & Parser Petrov et al. automatically split non-terminals using latent variables Example grammar rules: –S_3 NP_12 VP_6 prob=0.2 –NP_12 NP_9 NN_7 prob=0.1 –NN_7 house prob=0.06 Berkeley Coarse-to-Fine parser uses six latent variable grammars –Parse input sentence once with each grammar –Posterior probabilities from pass n used to prune pass n+1 –Must know mapping between non-terminals from different grammars Grammar(2) { NP_1, NP_6 } Grammar(3) { NP_2, NP_9, NP_14 }
8 Research Goals Our Research Goals –Find good solutions very quickly in this LARGE grammar space (not ML) –Algorithms should be grammar agnostic –Consider practical implications (speed, memory) This talk: Exponential Decay Pruning –Beam-Search parsing for efficient search –Searches the final grammar space directly –Balance overhead of targeted exploration (best-first) vs. memory and cache benefits of local exploration (CYK)
9 Parsing Algorithms: CYK Intro to Syntactic Parsing –Why Parse? Parsing Algorithms –CYK –Best-First –Beam-Search Exponential Decay Pruning Results
10 Parsing Algorithms: CYK Exhaustive population of all parse trees permitted by the grammar Dynamic Programming algorithm give Maximum Likelihood solution
11 Parsing Algorithms: CYK Fill in cells for SPAN=1,2,3,4,… Grammar S NP VP (p=0.7) NP NP NP (p=0.2) NP NP VP (p=0.1) NN court (p=0.4) VB court (p=0.1) ….
12 Parsing Algorithms: CYK Grammar S NP VP (p=0.7) NP NP NP (p=0.2) NP NP VP (p=0.1) NN court (p=0.4) VB court (p=0.1) …. N iterations through the grammar at each chart cell to consider all possible midpoints
13 Parsing Algorithms: Best-First Intro to Syntactic Parsing –Why Parse? Parsing Algorithms –CYK –Best-First –Beam-Search Exponential Decay Pruning Results
14 Parsing Algorithms: Best-First Grammar S NP VP (p=0.7) VB court (p=0.1) …. Frontier PQ [try][shooting,defendant] VP VB NP fom=28.1 [try,shooting][defendant] VP VB NP fom=14.7 [Juvenile][court] NP ADJ NN fom=13 Frontier is a Priority Queue of all potentially buildable entries Add best entry from Frontier; expand Frontier with all possible chart + grammar extensions
15 Parsing Algorithms: Best-First Grammar S NP VP (p=0.7) VB court (p=0.1) …. Frontier PQ [try][shooting,defendant] VP VB NP fom=28.1 [try,shooting][defendant] VP VB NP fom=14.7 [Juvenile][court] NP ADJ NN fom=13 Frontier is a Priority Queue of all potentially buildable entries Add best entry from Frontier; expand Frontier with all possible chart + grammar extensions
16 Parsing Algorithms: Best-First How do we rank Frontier entries? –Figure-of-Merit (FOM) –FOM = Inside (grammar) * Outside (heuristic) –Caraballo and Charniak, 1997 (C&C) –Problem with comparisons of different spans Grammar S NP VP (p=0.7) VB court (p=0.1) …. Frontier PQ [try][shooting,defendant] VP VB NP fom=28.1 [try,shooting][defendant] VP VB NP fom=14.7 [Juvenile][court] NP ADJ NN fom=13
17 Parsing Algorithms: Beam-Search Intro to Syntactic Parsing –Why Parse? Parsing Algorithms –CYK –Best-First –Beam-Search Exponential Decay Pruning Results
18 Parsing Algorithms: Beam-Search Beam-Search: Best of both worlds CKY exhaustive traversal (bottom-up) At each chart cell –Compute FOM for all possible cell entries –Rank entries in a (temporary) local priority queue –Only populate the cell with the n-best entries (beam-width) Less Memory –Not storing all cell entries (CYK) nor bad frontier entries (Best-First) Runs Faster –Search space is pruned (unlike CYK) and don’t need to maintain global priority queue (Best-First) Eliminates problem of global cell entry comparison
19 Parsing Algorithms: Beam-Search Intro to Syntactic Parsing –Why Parse? Parsing Algorithms –CYK –Best-First –Beam-Search Exponential Decay Pruning Results
20 Exponential Decay Pruning What is the optimal beam-width per chart cell? –Common solutions: Relative score difference from highest ranking entry Global maximum number of candidates Exponential Decay Pruning –Adaptive beam-width conditioned on chart cell information –How reliable is our Figure-of-Merit per chart cell? –Plotted rank of Gold entry against span and sentence size FOM is more reliable for larger spans –Less dependent on outside estimate FOM is less reliable for short sentences –Atypical grammatical structure (in WSJ?)
21 Exponential Decay Pruning Confidence in FOM can be modeled with the Exponential Decay function –N 0 = Global beam-width maximum –n = sentence length –s = span length (number of words covered) –λ = tuning parameter
22 Exponential Decay Pruning Confidence in FOM can be modeled with the Exponential Decay function
23 Intro to Syntactic Parsing –Why Parse? Parsing Algorithms –CYK –Best-First –Beam-Search Exponential Decay Pruning Results
24 Results Wall Street Journal treebank –Train: Sections 2-21 (40k sentences) –Dev: Section 24 (1.3k sentences –Test: Section 23 (2.4k sentences) Berkeley SM6 Latent Variable Grammar Figure-of-Merit from Caraballo and Charniak, 1997 (C&C) Also applied Cell Closing Constraints (Roark and Hollingshead, 2008) External comparison with Berkeley Coarse-to-Fine parser using same grammar
25 Results: Dev AlgorithmFOMBeam- Width Cell Closing Seconds per Sent Chart Entries F-Score CYK Best-FirstInside Best-FirstC&C Beam-SearchInsideConstant Beam-SearchInsideDecay Beam-SearchC&CConstant Beam-SearchC&CDecay Beam-SearchC&CConstantYes Beam-SearchC&CDecayYes Figure-of-Merit makes a big difference Fast solution, but significant accuracy degradation
26 Results: Dev AlgorithmFOMBeam- Width Cell Closing Seconds per Sent Chart Entries F-Score CYK Best-FirstInside Best-FirstC&C Beam-SearchInsideConstant Beam-SearchInsideDecay Beam-SearchC&CConstant Beam-SearchC&CDecay Beam-SearchC&CConstantYes Beam-SearchC&CDecayYes Using the inside probability for the FOM –95% speed reduction with Beam-Search over Best-First –Exponential Decay adds additional 47% speed reduction
27 Results: Dev AlgorithmFOMBeam- Width Cell Closing Seconds per Sent Chart Entries F-Score CYK Best-FirstInside Best-FirstC&C Beam-SearchInsideConstant Beam-SearchInsideDecay Beam-SearchC&CConstant Beam-SearchC&CDecay Beam-SearchC&CConstantYes Beam-SearchC&CDecayYes Using the C&C FOM –Beam-Search is faster (57%) and more accurate than Best-First –Exponential Decay adds additional 40% speed reduction
28 Results: Dev AlgorithmFOMBeam- Width Cell Closing Seconds per Sent Chart Entries F-Score CYK Best-FirstInside Best-FirstC&C Beam-SearchInsideConstant Beam-SearchInsideDecay Beam-SearchC&CConstant Beam-SearchC&CDecay Beam-SearchC&CConstantYes Beam-SearchC&CDecayYes
29 Results: Test AlgorithmFOMBeam- Width Cell Closing Seconds per Sent F-Score CYK Beam-SearchC&CConstant Beam-SearchC&CDecay Beam-SearchC&CDecayYes Berkeley C2F % relative speed-up (Decay vs. Constant beam-width) Decay pruning and Cell Closing Constraints are complementary Same ball-park as Coarse-to-Fine (perhaps a bit faster) Requires no knowledge of the grammar
30 Thanks
31 FOM Details C&C FOM Details –FOM(NT) = Outside left * Inside * Outside right –Inside = Constituent grammar score for NT –Outside left = Max { POS forward prob * POS-to-NT transition prob } –Outside right = Max { NT-to-POS transition prob * POS bkwd prob }
32 FOM Details C&C FOM Details
33 Research Goals –Find good solutions very quickly in this LARGE grammar space (not ML) –Algorithms should be grammar agnostic –Consider practical implications (speed, memory) Current projects towards these goals –Better FOM function Inside estimate (grammar refinement) Outside estimate (participation in complete parse tree) –Optimal chart traversal strategy Which areas of the search space are most promising? Cell Closing Constraints (Roark and Hollingshead, 2008) –Balance between targeted and exhaustive exploration How much “work” should be done exploring the search space around these promising areas? Overhead of targeted exploration (best-first) vs. memory and cache benefits of local exploration (CYK)