Daniel Fried, Mitchell Stern and Dan Klein UC Berkeley

Daniel Fried*, Mitchell Stern* and Dan Klein UC Berkeley
Improving Neural Parsing by Disentangling Model Combination and Reranking Effects **Currently, the best systems for constituency parsing use neural models, which can be intractable on their own, to rerank efficient base parsers. But how much of the good performance of these systems is due to the rerankers being strong, and how much is due to subtle combination effects between the two models? Daniel Fried*, Mitchell Stern* and Dan Klein UC Berkeley

Top-down generative models
The man had an idea . NP VP S (S (NP The man ) (VP idea ) an . (NP had The models we’ll look at are generative: jointly produce a sentence and parse tree, using a top-down traversal. Here’s an example parse, for the sent … The model traverses a linearized representation of the tree and sentence, from left-to-right. At each point, the model chooses an action conditioned on all of the past actions. We’ll look at two recently-introduced generative parsing models that both have this same action space. LSTM over action history vs. explicit compositional representation of the syntax. Inferring a parse for a given sentence is difficult: model support is all trees, not just for this sent. We’ll see more about this in a little while. GLSTM [Parsing as Language Modeling, Choe and Charniak, 2016] GRNNG [Recurrent Neural Network Grammars, Dyer et al. 2016]

Generative models as rerankers
generative neural model base parser B G The man had an idea . NP VP S-INV The man had an idea . NP VP S The man had an idea . NP ADJP S Since inference is difficult, systems that use these generative models have relied on base parsers, where search is easier, and used the generative models as rerankers. Let B be our base parser, and G be our top-down, generative neural model. The procedure is: B proposes a set of candidate parses for a sentence, and G chooses the one with the highest score 𝑦 ~ 𝑝𝑩 𝑦 𝑥) argmax𝑦 𝑝 𝑮 (𝑥,𝑦)

Generative models as rerankers
generative neural model base parser B G F1 on Penn Tree Bank 89.7 Charniak parser 92.6 LSTM language model (GLSTM) Choe and Charniak 2016 The papers that introduced these generative models saw large gains, measured by F1, when using the generative neural model to choose from the base parser’s output. Both in the case of the LSTM parser, and RNNG. The gap here is large enough that people have assumed G provides the power, B is just necessary to speed things up. Is this actually happening? 91.7 RNNG-discriminative 93.3 RNNG-generative (GRNNG) Dyer et al. 2016

B: Necessary evil, or secret sauce?
generative neural model base parser B G Should we try to do away with B? What role is B really playing here? Is it just a necessary evil we need to make parsing with G efficient? Or, alternatively, is B the secret sauce that combines with G to get these really strong results? Should we try to get rid of B, if we could? No -- we’ll find that it’s better to combine the two models more directly than is currently being done in reranking, as B actually filters out model errors in G. More explicit model combination leads us to new state-of-the-art results on the Penn Treebank. No, better to combine B and G more explicitly 93.9 F1 on PTB; semi-supervised

Using standard beam search for G
True Parse (S (NP The man ... The (S (NP (VP (PP (NP (NP To do away with the base parser, we need a search procedure for G. Actions depend on unbounded context, so there’s no dynamic program. But we can apply standard beam search – maintain a beam of partial parses, that have all taken the same number of actions, taking the highest scoring partial parses at each time step. <step through example> However, we find that search procedure prefers to just open constituents. This is a limitation of the search procedure: it’s finding things that have lower model probability than the true parses; the true parse is just falling off the beam too early. This is borne out in the results; even with a large beam size, we get low performance in terms of F1. Beam Size 100 GRNNG 29.1 F1 GLSTM 27.4 F1

Standard beam search in G fails
Word generation is lexicalized: (S (NP idea ) an . had (VP The man The underlying reason for this is, since this is a generative model, it needs to also choose the word that is being produced, and so probability for word-generation actions is spread over the whole vocabulary. Looking more closely at the log probabilities for individual actions, we see that the structural actions (opening and closing constituents) have much higher probabilities than the lexicalized, word-generation actions. The low performance with standard beam search are likely why these models have been used as rerankers.

Word-synchronous beam search
(NP (VP (PP The man (NP Our approach, motivated by a lot of similar work in the literature, is to modify the beam search to organize it around these high-perplexity events: word-generation actions. Do beam search within each word, moving the word generation actions on to a new beam as soon as they happen. When this beam reaches a certain size, start search from there for the next word. This allows equal competition, preventing word generation actions competing compete against the higher probability structural actions. This is just a sketch, more details are in the paper. [Roark 2001; Titov and Henderson 2010; Charniak 2010; Buys and Blunsom 2015 ]

Word-synchronous beam search
B → GLSTM B → GRNNG GLSTM GRNNG This improved beam search gets much better performance in terms of F1 for roughly the same amount of search (e.g. size 100), although it’s still low. Performance starts to level-off as we approach pretty large beam sizes. But it’s still below the B -> G reranking setup. One possibility: search in generative model might not be exploring it thoroughly enough. Or, it could hint at some model combination effects: searching in B and rescoring with G implicitly combines the two models, and maybe it’s this combination that’s partly responsible.

Finding model combination effects
Add G’s search proposal to candidate list: G ∪ B G The man had an idea . NP VP S The man had an idea . NP VP S The man had an idea . NP VP S The man had an idea . NP VP S-INV S- ADJP S Construct a test to see if some beneficial model combination is happening in the original B -> G reranking setup: We originally have a set of candidates from the base parser, B. Take the candidate parses found by searching in the generative model, G, and combine them with the candidates from B. Allow G to choose the parse that it assigns highest probability. If performance, measured by F1, goes up, indicates that G is a strong model, but we just weren’t doing enough search in it in the B -> G setting. But, if performance goes down, shows that there were errors in G were being hidden by only using candidates from B in the B -> G setting. This is an indication of model combination.

Finding model combination effects
F1 on PTB 93.7 93.5 93.5 92.8 So here are the results of that experiment. First we have F1 for the standard B -> G reranking setting, for both generative models. Now we add in candidates found by searching in the generative model, and let G choose from among the full set. Performance actually does drop. When G is allowed a broader range of candidates to choose from, it actually does worse. B GRNNG ∪ B B GLSTM ∪ B RNNG Generative Model LSTM Generative Model

B hides model errors in G
Reranking shows implicit model combination B G B hides model errors in G This strongly suggests model combination in the original reranking papers got good results because B was filtering out slightly worse parses that G would choose, if it had the chance. This leads us to the next part of our talk: how can we improve further, knowing that B and G are complementary to each other?

Making model combination explicit
Can we do better by simply combining model scores? B + G B G B ∪ A simple way is to select from the list of candidate parses by using a combination of scores from the models. Tune lambda to maximize F1 on the development set. 𝜆 −𝜆 log 𝑝 𝑩 (𝑦|𝑥) log 𝑝 𝑮 (𝑥,𝑦)

Making model combination explicit
F1 on PTB 94.2 94.0 93.9 93.9 93.7 93.5 93.5 Start with the graph we’ve seen before: select candidates using only the scores from the generative model. Now, select using this interpolation of scores from the base and the generative. We see improvements in all cases. Using scores from both models is more than able to compensate from the dip in performance we got when adding in candidates from the generative. We get the best overall results using candidates and scores from both models. 92.8 B GRNNG ∪ B B GLSTM ∪ B RNNG Generative Model (G=GRNNG) LSTM Generative Model (G=GLSTM)

Explicit score combination prevents errors
+ B G B fast ∪ + G B G B best So we’ve seen that combining the models more explicitly helps. Using the scores of both models to select from the base parser’s candidates is fast, taking almost no more work than the standard B -> G setting. If we’re able to also add in candidates found by searching in the generative model, we can do even better.

Comparison to past work
F1 on PTB 94.7 add silver data 93.9 add GLSTM 93.8 add silver data 93.6 93.5 GRNNG ∪ B→GRNNG+B 93.3 Final results, in comparison to past generative reranking systems. <each one> Green bar: candidate and score combination from the RNNG generative and the base. We further improve on model combination by using the scores from all three models. Choe and Charniak also trained on semi-supervised, silver data: trained their models on other parsers’ output on a large unannotated text corpus. We do the same for all models. *fix bar overlays. 92.6 Choe & Charniak 2016 Dyer et al. 2016 Kuncoro et al. 2017 Ours

Conclusion G B G B G + Search procedure for
(more effective version forthcoming: Stern et al., EMNLP 2017) Found model combination effects in B G We searched directly in the generative models. Procedure is still too slow to be used as a practical parser. But enough to explore the space and see what’s possible. (see upcoming paper at EMNLP for an improved version) We found that, while the models are strong, they still make errors which the B -> G reranking setting helps compensate for. Explicit score combination helps even in the reranking setting (when we only have candidates from the base parser). It doesn’t require any more work than’s currently being done, just keep the score from the base model and interpolate. Hybrid systems can benefit from combining their parts more explicitly: using scores, and candidates if possible. Large improvements from simple, explicit score combination: B G +

Thanks!

Daniel Fried, Mitchell Stern and Dan Klein UC Berkeley

Similar presentations

Presentation on theme: "Daniel Fried, Mitchell Stern and Dan Klein UC Berkeley"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Daniel Fried*, Mitchell Stern* and Dan Klein UC Berkeley

Similar presentations

Presentation on theme: "Daniel Fried*, Mitchell Stern* and Dan Klein UC Berkeley"— Presentation transcript:

Similar presentations

About project

Feedback

Daniel Fried, Mitchell Stern and Dan Klein UC Berkeley

Presentation on theme: "Daniel Fried, Mitchell Stern and Dan Klein UC Berkeley"— Presentation transcript: