Adversarial Learning for Neural Dialogue Generation

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

Supervised Learning Recap
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Distributed Representations of Sentences and Documents
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Neural Networks Chapter 7
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Attention Model in NLP Jichuan ZENG.
Lecture 10 SeqGAN, Chatbot, Reinforcement Learning
Convolutional Sequence to Sequence Learning
Reinforcement Learning
Unsupervised Learning of Video Representations using LSTMs
Semi-Supervised Clustering
Machine Learning – Classification David Fenyő
Stochastic tree search and stochastic games
Deep Feedforward Networks
Introduction of Reinforcement Learning
Deep Learning Amin Sobhani.
Machine Learning & Deep Learning
Dhruv Batra Georgia Tech
Deep Reinforcement Learning
Chapter 6: Temporal Difference Learning
Mastering the game of Go with deep neural network and tree search
Neural Machine Translation by Jointly Learning to Align and Translate
Conditional Random Fields for ASR
AlphaGo with Deep RL Alpha GO.
Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
Deep reinforcement learning
Intelligent Information System Lab
Basic machine learning background with Python scikit-learn
Reinforcement Learning
Reinforcement Learning
Reinforcement learning with unsupervised auxiliary tasks
Hybrid computing using a neural network with dynamic external memory
Neural Language Model CS246 Junghoo “John” Cho.
Distributed Representation of Words, Sentences and Paragraphs
Hidden Markov Models Part 2: Algorithms
Final Presentation: Neural Network Doc Summarization
Deep Learning for Non-Linear Control
Chapter 6: Temporal Difference Learning
Attention.
Machine learning overview
Word embeddings (continued)
Linear Discrimination
Attention for translation
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Introduction to Neural Networks
Neural Machine Translation
Discriminative Training
Angel A. Cantu, Nami Akazawa Department of Computer Science
CSC 578 Neural Networks and Deep Learning
Continuous Curriculum Learning for RL
Neural Machine Translation by Jointly Learning to Align and Translate
Reinforcement Learning
ONNX Training Discussion
Presentation transcript:

Adversarial Learning for Neural Dialogue Generation 2017.2.17 Zhang Yan Li, Jiwei, et al. "Adversarial Learning for Neural Dialogue Generation." arXiv preprint arXiv:1701.06547 (2017).

Goal “to train to produce sequences that are indistinguishable from human-generated dialogue utterances.”

Main Contribution Propose to use an adversarial training approach for response generation and cast the model in the framework of reinforcement learning.

Adversarial Reinforcement Model

Adversarial Training MinMax Game between Generator vs Discriminator

Model Breakdown The model has two main parts, G and D: Generative Model (G) Generates a response y given dialogue history x. Standard Seq2Seq model with Attention Mechanism Discriminative Model (D) Binary Classifier that takes as input a sequence of dialogue utterances {x, y} and outputs label indicating whether the input is generated by human or machines Hierarchical Encoder + 2 class softmax function -> returns probability of the input dialogue episode being a machine or human generated dialogues.

Model Breakdown The model has two main parts, G and D: Generative Model (G) Generates a response y given dialogue history x consisting of a sequence of dialogue utterances Standard Seq2Seq model with Attention Mechanism Discriminative Model (D) Binary Classifier that takes as input a sequence of dialogue utterances {x, y} and outputs label indicating whether the input is generated by human or machines Hierarchical Encoder + 2 class softmax function -> returns probability of the input dialogue episode being a machine or human generated dialogues.

Seq2Seq Models for Response Generation (Sutskever et al., 2014; Jean et al., 2014) Source : Input Messages Target : Responses

Seq2Seq Models with Attention Mechanism [Luong et al., 2015] Attention Mechanism predicts the output y with a weighted average context vector c, not just the last state

Model Breakdown The model has two main parts, G and D: Generative Model (G) Generates a response y given dialogue history x consisting of a sequence of dialogue utterances Standard Seq2Seq model with Attention Mechanism Discriminative Model (D) Binary Classifier that takes as input a sequence of dialogue utterances {x, y} and outputs label indicating whether the input is generated by human or machines Hierarchical Encoder + 2 class softmax function -> returns probability of the input dialogue episode being a machine or human generated dialogues.

Model Breakdown The model has two main parts, G and D: Generative Model (G) Generates a response y given dialogue history x consisting of a sequence of dialogue utterances Standard Seq2Seq model with Attention Mechanism Discriminative Model (D) Binary Classifier that takes as input a sequence of dialogue utterances {x, y} and outputs label indicating whether the input is generated by human or machines Hierarchical Encoder + 2-class softmax function -> returns probability of the input dialogue episode being a machine or human generated dialogues.

Training Methods Policy Gradient Methods: The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm.

Training Methods Policy Gradient Methods: The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm.

Training Methods Policy Gradient Methods: The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm. approximated by likelihood ratio

Training Methods Policy Gradient Methods: The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm. approximated by likelihood ratio

Training Methods Policy Gradient Methods: The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm. approximated by likelihood ratio

Training Methods Policy Gradient Methods: The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm. baseline value to reduce the variance of the estimate while keeping it unbiased classification score approximated by likelihood ratio gradient in parameter space policy

Training Methods Policy Gradient Methods: The score of current utterances being human-generated ones assigned by the discriminator is used as a reward for the generator, which is trained to maximize the expected reward of generated utterances using REINFORCE algorithm. scalar reward approximated by likelihood ratio policy updates by the direction of the reward in the parameter space

Training Methods (Cont’d) Problem with REINFORCE: has disadvantage that the expectation of the reward is approximated by only one sample, and reward associated with the sample is used for all actions REINFORCE assigns the same negative reward to all tokens [I, don’t, know] by comparing them with I don’t know Proper credit assignment in training would give separate rewards, most likely a neutral token for token I, and negative reward to don’t and know. Authors of the paper calls it: Reward for Every Generation Step (REGS)

Training Methods (Cont’d) Problem with REINFORCE: has disadvantage that the expectation of the reward is approximated by only one sample, and reward associated with the sample is used for all actions REINFORCE assigns the same negative reward to all tokens [I, don’t, know] by comparing them with I don’t know Proper credit assignment in training would give separate rewards, most likely a neutral token for token I, and negative reward to don’t and know. Authors of the paper calls it: Reward for Every Generation Step (REGS)

Training Methods (Cont’d) Problem with REINFORCE: has disadvantage that the expectation of the reward is approximated by only one sample, and reward associated with the sample is used for all actions Vanilla REINFORCE model assigns the same negative reward to all tokens [I, don’t, know] by comparing them with I don’t know Proper credit assignment in training would give separate rewards, most likely a neutral token for token I, and negative reward to don’t and know. Authors of the paper calls it: Reward for Every Generation Step (REGS) Input : What’s your name human : I am John machine : I don’t know

Training Methods (Cont’d) Problem with REINFORCE: has disadvantage that the expectation of the reward is approximated by only one sample, and reward associated with the sample is used for all actions Vanilla REINFORCE model assigns the same negative reward to all tokens [I, don’t, know] by comparing them with I don’t know Proper credit assignment in training would give separate rewards, most likely a neutral token for token I, and negative reward to don’t and know. Input : What’s your name human : I am John machine : I don’t know

Training Methods (Cont’d) Problem with REINFORCE: has disadvantage that the expectation of the reward is approximated by only one sample, and reward associated with the sample is used for all actions Vanilla REINFORCE model assigns the same negative reward to all tokens [I, don’t, know] by comparing them with I don’t know Proper credit assignment in training would give separate rewards, most likely a neutral token for token I, and negative reward to don’t and know. Authors of the paper calls it: Reward for Every Generation Step (REGS) Input : What’s your name human : I am John machine : I don’t know

Reward for Every Generation Step (REGS) We need rewards for intermediate steps. Two Strategies Introduced: Monte Carlo (MC) Search Training Discriminator For Rewarding Partially Decoded Sequences

Monte Carlo Search Given a partially decoded s, the model keeps sampling tokens from the distribution until the decoding finishes Repeats N times (N generated sequences will share a common prefix s). These N sequences are fed to the discriminator, the average score of which is used as a reward. To set up the synthetic data experiments, we first initialize the parameters of an LSTM network following the normal distribution N (0, 1) as the oracle describing the real data distribution Goracle(xt|x1, . . . , xt−1). Then we use it to generate 10,000 sequences of length 20 as the training set S for the generative models. We use a randomly initialized LSTM as the true model, aka, the oracle, to generate the real data distribution p(xt|x1, . . . , xt−1) for the following experiments. When optimizing discriminative models, supervised training is applied to minimize the cross entropy, which is widely used as the objective function for classification and prediction tasks: L(y, yˆ) = −y log ˆy − (1 − y) log(1 − yˆ), (35) where y is the ground truth label of the input sequence and yˆ is the predicted probability from the discriminative models.

Monte Carlo Search time-consuming ! Given a partially decoded s, the model keeps sampling tokens from the distribution until the decoding finishes Repeats N times (N generated sequences will share a common prefix s). These N sequences are fed to the discriminator, the average score of which is used as a reward. time-consuming !

Rewarding Partially Decoded Sequences Directly train a discriminator that is able to assign rewards to both fully and partially decoded sequences Break generated sequences into partial sequences Problem: Earlier actions in a sequence are shared among multiple training examples for discriminator. Result in overfitting The author proposes a similar strategy used in AlphaGo to mitigate the problem.

Rewarding Partially Decoded Sequences For each collection of subsequences of Y, randomly sample only one example from positive examples and one example from negative examples, which are used to update discriminator. Time effective but less accurate than MC model.

Rewarding Partially Decoded Sequences For each collection of subsequences of Y, randomly sample only one example from positive examples and one example from negative examples, which are used to update discriminator. Time effective but less accurate than MC model.

Rewarding Partially Decoded Sequences classification score baseline value policy gradient in parameter space classification score baseline value policy gradient in parameter space

Teacher Forcing Generative model is still unstable, because: generative model can only be indirectly exposed to the gold-standard target sequences through the reward passed back from the discriminator. This reward is used to promote or discourage the generator’s own generated sequences.

Teacher Forcing Generative model is still unstable, because: generative model can only be indirectly exposed to the gold-standard target sequences through the reward passed back from the discriminator. This reward is used to promote or discourage the generator’s own generated sequences. This is fragile, because: Once a generator accidentally deteriorates in some training batches And Discriminator consequently does an extremely good job in recognizing sequences from the generator, the generator immediately gets lost It knows that the generated results are bad, but does not know what results are good.

Teacher Forcing The author proposes feeding human generated responses to the generator for model updates. discriminator automatically assigns a reward of 1 to the human responses and feed it to the generator to use this reward to update itself. Analogous to having a teacher intervene and force it to generate the true responses

Pseudocode for the Algorithm

Result

Result Adversarially-Trained system generates higher-quality responses than previous baselines!

Notes It did not show great performance on abstractive summarization task. Maybe because adversarial training strategy is more beneficial to: Tasks in which there is a big discrepancy between the distributions of the generated sequences and the reference target sequences Tasks in which input sequences do not bear all the information needed to generate the target in other words, there is no single correct target sequence in the semantic space.