Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing
Training models that make local decisions to be aware of non-local consequences. I’ll be talking about two ways of dealing with this for constituency parsing: dynamic oracles (successful recently), and policy gradient (focus of this paper). Daniel Fried and Dan Klein

Parsing by Local Decisions
VP NP NP The cat took a nap . We’ll be working with transition-based parsers. These produce parse trees incrementally, conditioned on the sentence, using some decomposition of the tree. Variety of different ways to derive the tree, historically bottom-up shift reduce methods, but as a running example I’ll show a top-down method used successfully recently. It produces a linearized representation of the tree: In all of the transition-based parsers we’ll look at, actions are predicted incrementally conditioned on the input sentence, and all of the past actions. Likelihood for a given sentence is the sum of log probs of these local decisions, in green. Since these parsers make incremental predictions, they’re fast, and in recent years they’ve obtained state-of-the-art. But these local decisions may have non-local consequences. (S (NP The cat ) (VP … 𝐿 𝜃 = log 𝑝 𝑦 𝑥;𝜃) = 𝑡 log 𝑝( 𝑦 𝑡 | 𝑦 1:𝑡−1 ,𝑥;𝜃)

Non-local Consequences
Loss-Evaluation Mismatch The cat took a nap . NP VP S The cat took a nap . VP NP S 𝑦 𝑦 ∆(𝑦, 𝑦 ): -F1(𝑦, 𝑦 ) Exposure Bias What are the consequences of producing structure using local decisions? Past work has identified a few potential problems. Loss-evaluation mismatch. These parsers are trained to maximize the likelihood of each decomposed action. But this isn’t necessarily aligned with evaluation metrics we care about, which are defined on the entire structure in a way that doesn’t decompose along the actions. Second, more subtle issue: exposure bias: each prediction conditions on the entire past history. The parser gets fed its own predictions. What if it makes a mistake, and now has some context that is unlike any it saw in training? Each mistake poisons your future decisions, unless you have a principled way of dealing with this in training. True Parse 𝑦 (S (NP The cat … Prediction 𝑦 (S (NP (VP ?? [Ranzato et al. 2016; Wiseman and Rush 2016]

Dynamic Oracle Training
Explore at training time. Supervise each state with an expert policy. 𝑦 True Parse (S (NP The cat … addresses exposure bias Prediction (sample, or greedy) 𝑦 (S (NP (VP The … Oracle 𝑦 ∗ (NP The The cat Dynamic oracles are a way to address both of these. Have the model follow its own predictions at training time, and supervise with an expert, called a dynamic oracle. Design a dynamic oracle to tell you what’s the best action to take from any parser state that you could wind up in. 𝐿 𝜃 = 𝑡 log 𝑝( 𝑦 𝑡 ∗ | 𝑦 1:𝑡−1 ,𝑥;𝜃) addresses loss mismatch 𝑦 𝑡 ∗ choose to maximize achievable F1 (typically) [Goldberg & Nivre 2012; Ballesteros et al. 2016; inter alia]

Expert Policies / Dynamic Oracles PTB Constituency Parsing F1
Dynamic Oracles Help! Expert Policies / Dynamic Oracles Daume III et al., 2009; Ross et al., 2011; Choi and Palmer, 2011; Goldberg and Nivre, 2012; Chang et al., 2015; Ballesteros et al., 2016; Stern et al. 2017 mostly dependency parsing PTB Constituency Parsing F1 System Static Oracle Dynamic Oracle Coavoux and Crabbé, 2016 88.6 89.0 Cross and Huang, 2016 91.0 91.3 Fernández-González and Gómez-Rodríguez, 2018 91.5 91.7 A long line of work shows that these dynamic oracles, or more generally expert policies, help improve performance. Much of this work has been for dependency parsing, but in recent years people have successfully applied them to constituency parsing as well. They found consistent improvements.

What if we don’t have a dynamic oracle? Use reinforcement learning
Dynamic oracles are great if they apply – they give direct supervision that helps deal with non-local consequences. But they need to be custom-designed for a given parser’s transition system. In this work we’re going to investigate another, more general option, which you can use if you don’t have a dynamic oracle. RL: discover non-local consequences: take some actions predicted by your model, observe their effect on the entire output, update accordingly.

Reinforcement Learning Helps! (in other tasks)
machine translation Auli and Gao, 2014; Ranzato et al., 2016; Shen et al., 2016 Xu et al., 2016; Wiseman and Rush, 2016; Edunov et al. 2017 machine translation RL has a long history, here are some nearest neighbors to our work: applying RL to other structured tasks in NLP. CCG parsing several, including dependency parsing

Policy Gradient Training
Minimize expected sequence-level cost: True Parse 𝑦 Prediction 𝑦 𝑅(𝜃)= 𝑦 𝑝 𝑦 𝑥;𝜃 ∆(𝑦, 𝑦 ) The man had an idea . NP VP S The man had an idea . NP VP S ∆(𝑦, 𝑦 ) 𝛻𝑅 𝜃 = 𝑦 𝑝 𝑦 𝑥;𝜃 ∆ 𝑦, 𝑦 𝛻 log 𝑝( 𝑦 |𝑥;𝜃) Can be applied across tasks with structured costs – directly address the loss-mismatch problem. Formally, minimize this risk objective: the expectation under the model of the structured cost. Compute the gradient of this risk objective, it comes out to be this: an expectation of the gradient, scaled by the structured cost. We have all of the parts that we need to compute this, and they’ll address both of the issues. addresses exposure bias (compute by sampling) addresses loss mismatch (compute F1) compute in the same way as for the true tree [Williams, 1992]

Policy Gradient Training
𝛻𝑅 𝜃 = 𝑦 𝑝 𝑦 𝑥;𝜃 ∆ 𝑦, 𝑦 𝛻 log 𝑝( 𝑦 |𝑥;𝜃) Input, 𝑥 The cat took a nap. The cat took a nap . NP VP S The cat took a nap . NP VP S-INV The cat took a nap . NP ADJP S The cat took a nap . NP VP S k candidates, 𝑦 ∆(𝑦, 𝑦 ) (negative F1) −89 −80 −80 −100 ∗ ∗ ∗ ∗ gradient for candidate 𝛻 log 𝑝( 𝑦 1 |𝑥;𝜃) 𝛻 log 𝑝( 𝑦 2 |𝑥;𝜃) 𝛻 log 𝑝( 𝑦 3 |𝑥;𝜃) 𝛻 log 𝑝(𝑦|𝑥;𝜃)

Experiments

Setup x Parsers Training Span-Based [Cross & Huang, 2016]
Top-Down [Stern et al. 2016] RNNG [Dyer et al. 2016] In-Order [Liu and Zhang, 2017] Training Static oracle Dynamic oracle Policy gradient x We train these four parsers, each with a different transition system for decomposing the tree. We compare the three training methods we’ve described. Static oracle: standard, no exploration. Dynamic oracle exploration, if an oracle has been defined for the parser’s transition system. And policy gradient. We also compare to another way of addressing loss-mismatch, based on softmax margin, see the paper for details.

English PTB F1 PG improves on static oracle in every setting.
We also compare to dynamic oracle training. Two of these parsers have existing dynamic oracles. One problem is you may not always have them. For the sake of comparison, we created a new dynamic oracle for RNNG in this paper. We find that it’s effective at supervising exploration. But you don’t have one until you invent it, and in-order doesn’t have one published yet. The dynamic oracle is sometimes meaningfully better than policy gradient, sometimes worse. Both of them result in improvement.

PTB learning curves for the Top-Down parser
Training Efficiency PTB learning curves for the Top-Down parser Real advantage of the dynamic oracle. It can be a lot faster. Here’s a learning curve, showing development accuracy by training epoch, or number of passes through the training data.

French Treebank F1 Whether the dynamic oracle is better varies, but policy gradient is better than static oracle in almost all cases.

Chinese Penn Treebank v5.1 F1
Gains from both policy gradient and dynamic oracle.

Conclusions Local decisions can have non-local consequences
Loss mismatch Exposure bias How to deal with the issues caused by local decisions? Dynamic oracles: efficient, model specific Policy gradient: slower to train, but general purpose More than before, we depend on local decisions. They have non-local consequences: training doesn’t directly optimize the cost you may be interested in, and the model may not be prepared to handle its own mistakes. We compared a couple methods for dealing with these issues. Dynamic oracles are effective, they work when they’re available, but they’re a knowledge-intensive solution. We find you can often get similar results with a more generic approach. Trade off training efficiency for agnosticism about the model. If you don’t have a dynamic oracle, you can just use policy gradient – it’s simple and typically effective.

Thank you!

For Comparison: A Novel Oracle for RNNG
(NP The man ) (VP had … 1. Close current constituent if it’s a true constituent… (S (NP The man ) … or it could never be a true constituent. (S (VP (NP The man ) ) Things that it checks to determine what next action to produce. Along with some examples. 2. Otherwise, open the outermost unopened true constituent at this position. (S (NP The man ) (VP 3. Otherwise, shift the next word. (S (NP The man ) (VP had

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing

Similar presentations

Presentation on theme: "Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing

Similar presentations

Presentation on theme: "Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing"— Presentation transcript:

Similar presentations

About project

Feedback