Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Natural Language Generation with Tree Conditional Random Fields Wei Lu, Hwee Tou Ng, Wee Sun Lee Singapore-MIT Alliance National University of Singapore.

Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.

Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Natural Language Processing Vasile Rus

Introduction to Machine Learning, its potential usage in network area,

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Cross-lingual Models of Word Embeddings: An Empirical Comparison

PSY 626: Bayesian Statistics for Psychological Science

Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C

Advanced Computer Systems

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

Clustering Data Streams

Raymond J. Mooney University of Texas at Austin

Large Margin classifiers

CSC 594 Topics in AI – Natural Language Processing

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Adversarial Learning for Neural Dialogue Generation

Mastering the game of Go with deep neural network and tree search

AlphaGo with Deep RL Alpha GO.

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Reinforcement Learning

Intelligent Information System Lab

Machine Learning Basics

Daniel Fried*, Mitchell Stern* and Dan Klein UC Berkeley

Statistical NLP Spring 2011

Deep Learning based Machine Translation

PSY 626: Bayesian Statistics for Psychological Science

Hidden Markov Models Part 2: Algorithms

Announcements Homework 3 due today (grace period through Friday)

CSCI 5832 Natural Language Processing

CSCI 5832 Natural Language Processing

Learning to Combine Bottom-Up and Top-Down Segmentation

Logistic Regression & Parallel SGD

Parsing and More Parsing

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

What Are They? Who Needs ‘em? An Example: Scoring in Tennis

Overview of Machine Learning

RL for Large State Spaces: Value Function Approximation

CS 188: Artificial Intelligence Fall 2007

CSCI 5832 Natural Language Processing

The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’

SAMANVITHA RAMAYANAM 18TH FEBRUARY 2010 CPE 691

October 6, 2011 Dr. Itamar Arel College of Engineering

Deep Reinforcement Learning

CS 188: Artificial Intelligence Fall 2008

ECE 352 Digital System Fundamentals

Designing Neural Network Architectures Using Reinforcement Learning

Lecture 6 Architecture Algorithm Defin ition. Algorithm 1stDefinition: Sequence of steps that can be taken to solve a problem 2ndDefinition: The step.

Model generalization Brief summary of methods

Machine learning overview

CSA2050 Introduction to Computational Linguistics

Statistical NLP Spring 2011

CS 416 Artificial Intelligence

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Jointly Generating Captions to Aid Visual Question Answering

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Distributed Reinforcement Learning for Multi-Robot Decentralized Collective Construction Gyu-Young Hwang

CS249: Neural Language Model

Presentation transcript:

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing Training models that make local decisions to be aware of non-local consequences. I’ll be talking about two ways of dealing with this for constituency parsing: dynamic oracles (successful recently), and policy gradient (focus of this paper). Daniel Fried and Dan Klein

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing Training models that make local decisions to be aware of non-local consequences. I’ll be talking about two ways of dealing with this for constituency parsing: dynamic oracles (successful recently), and policy gradient (focus of this paper). Daniel Fried and Dan Klein

Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing Training models that make local decisions to be aware of non-local consequences. I’ll be talking about two ways of dealing with this for constituency parsing: dynamic oracles (successful recently), and policy gradient (focus of this paper). Daniel Fried and Dan Klein

Parsing by Local Decisions VP NP NP The cat took a nap . We’ll be working with transition-based parsers. These produce parse trees incrementally, conditioned on the sentence, using some decomposition of the tree. Variety of different ways to derive the tree, historically bottom-up shift reduce methods, but as a running example I’ll show a top-down method used successfully recently. It produces a linearized representation of the tree: In all of the transition-based parsers we’ll look at, actions are predicted incrementally conditioned on the input sentence, and all of the past actions. Likelihood for a given sentence is the sum of log probs of these local decisions, in green. Since these parsers make incremental predictions, they’re fast, and in recent years they’ve obtained state-of-the-art. But these local decisions may have non-local consequences. (S (NP The cat ) (VP … 𝐿 𝜃 = log 𝑝 𝑦 𝑥;𝜃) = 𝑡 log 𝑝( 𝑦 𝑡 | 𝑦 1:𝑡−1 ,𝑥;𝜃)

Non-local Consequences Loss-Evaluation Mismatch The cat took a nap . NP VP S The cat took a nap . VP NP S 𝑦 𝑦 ∆(𝑦, 𝑦 ): -F1(𝑦, 𝑦 ) Exposure Bias What are the consequences of producing structure using local decisions? Past work has identified a few potential problems. Loss-evaluation mismatch. These parsers are trained to maximize the likelihood of each decomposed action. But this isn’t necessarily aligned with evaluation metrics we care about, which are defined on the entire structure in a way that doesn’t decompose along the actions. Second, more subtle issue: exposure bias: each prediction conditions on the entire past history. The parser gets fed its own predictions. What if it makes a mistake, and now has some context that is unlike any it saw in training? Each mistake poisons your future decisions, unless you have a principled way of dealing with this in training. True Parse 𝑦 (S (NP The cat … Prediction 𝑦 (S (NP (VP ?? [Ranzato et al. 2016; Wiseman and Rush 2016]

Dynamic Oracle Training Explore at training time. Supervise each state with an expert policy. 𝑦 True Parse (S (NP The cat … addresses exposure bias Prediction (sample, or greedy) 𝑦 (S (NP (VP The … Oracle 𝑦 ∗ (NP The The cat Dynamic oracles are a way to address both of these. Have the model follow its own predictions at training time, and supervise with an expert, called a dynamic oracle. Design a dynamic oracle to tell you what’s the best action to take from any parser state that you could wind up in. 𝐿 𝜃 = 𝑡 log 𝑝( 𝑦 𝑡 ∗ | 𝑦 1:𝑡−1 ,𝑥;𝜃) addresses loss mismatch 𝑦 𝑡 ∗ choose to maximize achievable F1 (typically) [Goldberg & Nivre 2012; Ballesteros et al. 2016; inter alia]

Expert Policies / Dynamic Oracles PTB Constituency Parsing F1 Dynamic Oracles Help! Expert Policies / Dynamic Oracles Daume III et al., 2009; Ross et al., 2011; Choi and Palmer, 2011; Goldberg and Nivre, 2012; Chang et al., 2015; Ballesteros et al., 2016; Stern et al. 2017 mostly dependency parsing PTB Constituency Parsing F1 System Static Oracle Dynamic Oracle Coavoux and Crabbé, 2016 88.6 89.0 Cross and Huang, 2016 91.0 91.3 Fernández-González and Gómez-Rodríguez, 2018 91.5 91.7 A long line of work shows that these dynamic oracles, or more generally expert policies, help improve performance. Much of this work has been for dependency parsing, but in recent years people have successfully applied them to constituency parsing as well. They found consistent improvements.

What if we don’t have a dynamic oracle? Use reinforcement learning Dynamic oracles are great if they apply – they give direct supervision that helps deal with non-local consequences. But they need to be custom-designed for a given parser’s transition system. In this work we’re going to investigate another, more general option, which you can use if you don’t have a dynamic oracle. RL: discover non-local consequences: take some actions predicted by your model, observe their effect on the entire output, update accordingly.

Reinforcement Learning Helps! (in other tasks) machine translation Auli and Gao, 2014; Ranzato et al., 2016; Shen et al., 2016 Xu et al., 2016; Wiseman and Rush, 2016; Edunov et al. 2017 machine translation RL has a long history, here are some nearest neighbors to our work: applying RL to other structured tasks in NLP. CCG parsing several, including dependency parsing

Policy Gradient Training Minimize expected sequence-level cost: True Parse 𝑦 Prediction 𝑦 𝑅(𝜃)= 𝑦 𝑝 𝑦 𝑥;𝜃 ∆(𝑦, 𝑦 ) The man had an idea . NP VP S The man had an idea . NP VP S ∆(𝑦, 𝑦 ) 𝛻𝑅 𝜃 = 𝑦 𝑝 𝑦 𝑥;𝜃 ∆ 𝑦, 𝑦 𝛻 log 𝑝( 𝑦 |𝑥;𝜃) Can be applied across tasks with structured costs – directly address the loss-mismatch problem. Formally, minimize this risk objective: the expectation under the model of the structured cost. Compute the gradient of this risk objective, it comes out to be this: an expectation of the gradient, scaled by the structured cost. We have all of the parts that we need to compute this, and they’ll address both of the issues. addresses exposure bias (compute by sampling) addresses loss mismatch (compute F1) compute in the same way as for the true tree [Williams, 1992]

Policy Gradient Training 𝛻𝑅 𝜃 = 𝑦 𝑝 𝑦 𝑥;𝜃 ∆ 𝑦, 𝑦 𝛻 log 𝑝( 𝑦 |𝑥;𝜃) Input, 𝑥 The cat took a nap. The cat took a nap . NP VP S The cat took a nap . NP VP S-INV The cat took a nap . NP ADJP S The cat took a nap . NP VP S k candidates, 𝑦 ∆(𝑦, 𝑦 ) (negative F1) −89 −80 −80 −100 ∗ ∗ ∗ ∗ gradient for candidate 𝛻 log 𝑝( 𝑦 1 |𝑥;𝜃) 𝛻 log 𝑝( 𝑦 2 |𝑥;𝜃) 𝛻 log 𝑝( 𝑦 3 |𝑥;𝜃) 𝛻 log 𝑝(𝑦|𝑥;𝜃)

Experiments

Setup x Parsers Training Span-Based [Cross & Huang, 2016] Top-Down [Stern et al. 2016] RNNG [Dyer et al. 2016] In-Order [Liu and Zhang, 2017] Training Static oracle Dynamic oracle Policy gradient x We train these four parsers, each with a different transition system for decomposing the tree. We compare the three training methods we’ve described. Static oracle: standard, no exploration. Dynamic oracle exploration, if an oracle has been defined for the parser’s transition system. And policy gradient. We also compare to another way of addressing loss-mismatch, based on softmax margin, see the paper for details.

English PTB F1 PG improves on static oracle in every setting. We also compare to dynamic oracle training. Two of these parsers have existing dynamic oracles. One problem is you may not always have them. For the sake of comparison, we created a new dynamic oracle for RNNG in this paper. We find that it’s effective at supervising exploration. But you don’t have one until you invent it, and in-order doesn’t have one published yet. The dynamic oracle is sometimes meaningfully better than policy gradient, sometimes worse. Both of them result in improvement.

PTB learning curves for the Top-Down parser Training Efficiency PTB learning curves for the Top-Down parser Real advantage of the dynamic oracle. It can be a lot faster. Here’s a learning curve, showing development accuracy by training epoch, or number of passes through the training data.

French Treebank F1 Whether the dynamic oracle is better varies, but policy gradient is better than static oracle in almost all cases.

Chinese Penn Treebank v5.1 F1 Gains from both policy gradient and dynamic oracle.

Conclusions Local decisions can have non-local consequences Loss mismatch Exposure bias How to deal with the issues caused by local decisions? Dynamic oracles: efficient, model specific Policy gradient: slower to train, but general purpose More than before, we depend on local decisions. They have non-local consequences: training doesn’t directly optimize the cost you may be interested in, and the model may not be prepared to handle its own mistakes. We compared a couple methods for dealing with these issues. Dynamic oracles are effective, they work when they’re available, but they’re a knowledge-intensive solution. We find you can often get similar results with a more generic approach. Trade off training efficiency for agnosticism about the model. If you don’t have a dynamic oracle, you can just use policy gradient – it’s simple and typically effective.

Thank you!

For Comparison: A Novel Oracle for RNNG (NP The man ) (VP had … 1. Close current constituent if it’s a true constituent… (S (NP The man ) … or it could never be a true constituent. (S (VP (NP The man ) ) Things that it checks to determine what next action to produce. Along with some examples. 2. Otherwise, open the outermost unopened true constituent at this position. (S (NP The man ) (VP 3. Otherwise, shift the next word. (S (NP The man ) (VP had