Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Advertisements

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Neural Networks Chapter 7
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CSCI 121 Special Topics: Bayesian Networks Lecture #4: Learning in Bayes Nets.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Conditional Generative Adversarial Networks
Reinforcement Learning
Generative Adversarial Network (GAN)
Teaching Machines to Converse
The Maximum Likelihood Method
Improving Generative Adversarial Network
M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)
Generative Adversarial Imitation Learning
Lecture 8. Generative Adversarial Network
Deep Neural Net Scenery Generation
Introduction of Reinforcement Learning
Deep Learning in Open Domain Dialogue Generation
Classification: Logistic Regression
Adversarial Learning for Neural Dialogue Generation
Gibbs sampling.
Chapter 6: Temporal Difference Learning
Mastering the game of Go with deep neural network and tree search
Photorealistic Image Colourization with Generative Adversarial Nets
Generative Adversarial Networks
CSCI 5922 Neural Networks and Deep Learning Generative Adversarial Networks Mike Mozer Department of Computer Science and Institute of Cognitive Science.
Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
The Maximum Likelihood Method
Reinforcement learning (Chapter 21)
ECE 5424: Introduction to Machine Learning
An Overview of Reinforcement Learning
Deep reinforcement learning
Reinforcement Learning
A brief introduction to neural network
The Maximum Likelihood Method
Improving Sequence Generation by GAN
Edges/curves /blobs Grammars are important because:
Markov Networks.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Large Scale Support Vector Machines
Introduction to Text Generation
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Logistic Regression.
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Expectation-Maximization & Belief Propagation
Psych 231: Research Methods in Psychology
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Textual Video Prediction
Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee
Markov Networks.
Maximum Likelihood We have studied the OLS estimator. It only applies under certain assumptions In particular,  ~ N(0, 2 ) But what if the sampling distribution.
Introduction to Neural Networks
Hsiao-Yu Chiang Xiaoya Li Ganyu “Bruce” Xu Pinshuo Ye Zhanyuan Zhang
Reinforcement Learning (2)
Neural Machine Translation
Discriminative Training
Reinforcement Learning (2)
Reinforcement Learning
Presentation transcript:

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Based on following two papers L. Yu, W. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial Nets with policy gradient. AAAI, 2017. J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue generation. arXiv: 1701.06547v4, 2017. And HY Lee’s lecture notes.

Maximizing Expected Reward Encoder Generator   Human Update θ In place of discriminator We wish to maximize the expected reward: θ* = arg maxθ γθ , where, γθ = Σh P(h) Σx R(h,x) Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) ---- sampling (h1,x1), … (hN,xN) But, now how do we do differentiation?

Policy gradient = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) Δ γθ= Σh P(h) Σx R(h,x) ΔPθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) ΔPθ (x|h) / Pθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h) sampling ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) But how do we do this?

Policy gradient Gradient ascent: θnew  θold + η Δγθ^old Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Note: 1. Without R(h,x), this is max. likelihood. 2. Without R(h,x), we know how to do this. 3. Too approximate this, we can: if R(hi,xi)=k, repeat (hi,xi) k times. if R(hi,xi)=-k, repeat (hi,xi) k times, with -η

If R(hi,xi) is always positive: Because it is probability … Ideal case Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3) Not sampled Due to Sampling (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)

Solution: subtract a baseline If R(hi,xi) is always positive, we subtract a baseline (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) (1/N) Σi=1..N ( R(hi ,xi ) – b) Δlog Pθ (xi |hi ) (h,x1) (h,x2) (h,x3) Not sampled Pθ(x|h) Subtract a baseline (h,x1) (h,x2) (h,x3)

Chatbot by SeqGAN Let’s replace human by a discriminator with reward function: R(h,x) = λ1r1(h,x) + λ2r2(h,x) + λ3r3(h,x) Encourage continuation Say something new Semantic coherence

Chat-bot by conditional GAN http://www.nipic.com/show/3/83/3936650kd7476069.html Chat-bot by conditional GAN Input sentence/history h En De response sentence x Chatbot Input sentence/history h Discriminator Real or fake response sentence x human dialogues

Can we do backpropogation? Discrimi nator scalar update Chatbot En De scalar Encoder Can we do backpropogation? B A A Tuning generator a little bit will not change the output. A A A B B B Alternative: improved WGAN (ignoring sampling process) <BOS> A B

SeqGAN solution, using RL Use the output of discriminator as reward Update generator to increase discriminator = to get maximum reward Different from typical RL The discriminator would update Discrimi nator scalar update Chatbot En De Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Discriminator Score

g-step d-step New Objective: fake real discriminator (1/N) Σi=1..N R(hi ,xi ) log Pθ (xi |hi ) (h1, x1) R(h1, x1) (h2, x2) R(h2, x2) … (hN, xN) R(hN, xN) θ t+1  θt + ηΔγθ^t (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) d-step discriminator     fake real

Rewarding a sentence vs word Consider example: hi =“what is your name” xi = “I do not know” Then logPθ(xi|hi) = logPθ(x1i|hi) +logPθ(x2i|hi,x1i) +logPθ(x3 i|hi,x1:2i) But if x = “I am Ming Li”, word I should have probability going up. If there are a lot of sentences to balance, this is usually ok. But when there is not enough samples, we can do reward at word level. I don’t know

Rewarding at word level Reward at sentence level was: Δγθ ≈ (1/N) Σi=1..N (R(hi ,xi )-b) Δlog Pθ (xi |hi ) Change to word level: Δγθ ≈ (1/N) Σi=1..N Σt=1..T (Q(hi ,x1:ti )-b)Δlog Pθ (xti |hi,x1:t-1i ) How to estimate Q? Monte Carlo.

Monte Carlo estimation of Q How to estimate Q(hi,x1:ti)? E.g. Q(“what is your name?”, “I”) Sample sentences starting with “I” using the current generator, and using the discriminator to evaluate xA = I am Ming Li D(hi, xA) = 1.0 xB = I am happy D(hi, xB) = 0.1 Q(hi, ”I”) = 0.5 xC = I don’t know D(hi, xC) = 0.1 xD = I am superman D(hi, xD) = 0.8

Experiments of Chatbot Reinforce = SeqGAN with reinforcement learning sentence level REGS Monte Carlo = SeqGAN with RL on word level

Li et al 2016 Example Results (Li, Monroe, Ritter, Galley, Gao, Jurafsky, EMNLP 2016)