Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Slides:

Advertisements

Similar presentations

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Advertisements

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Lecture 2 We have given O(n 3 ), O(n 2 ), O(nlogn) algorithms for the max sub-range problem. This time, a linear time algorithm! The idea is as follows:

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Neural Networks Chapter 7

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CSCI 121 Special Topics: Bayesian Networks Lecture #4: Learning in Bayes Nets.

Logistic Regression William Cohen.

Machine Learning 5. Parametric Methods.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Understanding AlphaGo. Go Overview Originated in ancient China 2,500 years ago Two players game Goal - surround more territory than the opponent 19X19.

MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.

Conditional Generative Adversarial Networks

Reinforcement Learning

Generative Adversarial Network (GAN)

Teaching Machines to Converse

The Maximum Likelihood Method

Improving Generative Adversarial Network

M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)

Generative Adversarial Imitation Learning

Lecture 8. Generative Adversarial Network

Deep Neural Net Scenery Generation

Introduction of Reinforcement Learning

Deep Learning in Open Domain Dialogue Generation

Classification: Logistic Regression

Adversarial Learning for Neural Dialogue Generation

Gibbs sampling.

Chapter 6: Temporal Difference Learning

Mastering the game of Go with deep neural network and tree search

Photorealistic Image Colourization with Generative Adversarial Nets

Generative Adversarial Networks

CSCI 5922 Neural Networks and Deep Learning Generative Adversarial Networks Mike Mozer Department of Computer Science and Institute of Cognitive Science.

Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.

The Maximum Likelihood Method

Reinforcement learning (Chapter 21)

ECE 5424: Introduction to Machine Learning

An Overview of Reinforcement Learning

Deep reinforcement learning

Reinforcement Learning

A brief introduction to neural network

The Maximum Likelihood Method

Improving Sequence Generation by GAN

Edges/curves /blobs Grammars are important because:

Markov Networks.

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Large Scale Support Vector Machines

Introduction to Text Generation

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Logistic Regression.

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

Expectation-Maximization & Belief Propagation

Psych 231: Research Methods in Psychology

Unifying Variational and GBP Learning Parameters of MNs EM for BNs

Textual Video Prediction

Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee

Markov Networks.

Maximum Likelihood We have studied the OLS estimator. It only applies under certain assumptions In particular,  ~ N(0, 2 ) But what if the sampling distribution.

Introduction to Neural Networks

Hsiao-Yu Chiang Xiaoya Li Ganyu “Bruce” Xu Pinshuo Ye Zhanyuan Zhang

Reinforcement Learning (2)

Neural Machine Translation

Discriminative Training

Reinforcement Learning (2)

Reinforcement Learning

Presentation transcript:

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Based on following two papers L. Yu, W. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial Nets with policy gradient. AAAI, 2017. J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue generation. arXiv: 1701.06547v4, 2017. And HY Lee’s lecture notes.

Maximizing Expected Reward Encoder Generator Human Update θ In place of discriminator We wish to maximize the expected reward: θ* = arg maxθ γθ , where, γθ = Σh P(h) Σx R(h,x) Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) ---- sampling (h1,x1), … (hN,xN) But, now how do we do differentiation?

Policy gradient = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) Δ γθ= Σh P(h) Σx R(h,x) ΔPθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) ΔPθ (x|h) / Pθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h) sampling ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) But how do we do this?

Policy gradient Gradient ascent: θnew  θold + η Δγθ^old Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Note: 1. Without R(h,x), this is max. likelihood. 2. Without R(h,x), we know how to do this. 3. Too approximate this, we can: if R(hi,xi)=k, repeat (hi,xi) k times. if R(hi,xi)=-k, repeat (hi,xi) k times, with -η

If R(hi,xi) is always positive: Because it is probability … Ideal case Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3) Not sampled Due to Sampling (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)

Solution: subtract a baseline If R(hi,xi) is always positive, we subtract a baseline (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) (1/N) Σi=1..N ( R(hi ,xi ) – b) Δlog Pθ (xi |hi ) (h,x1) (h,x2) (h,x3) Not sampled Pθ(x|h) Subtract a baseline (h,x1) (h,x2) (h,x3)

Chatbot by SeqGAN Let’s replace human by a discriminator with reward function: R(h,x) = λ1r1(h,x) + λ2r2(h,x) + λ3r3(h,x) Encourage continuation Say something new Semantic coherence

Chat-bot by conditional GAN http://www.nipic.com/show/3/83/3936650kd7476069.html Chat-bot by conditional GAN Input sentence/history h En De response sentence x Chatbot Input sentence/history h Discriminator Real or fake response sentence x human dialogues

Can we do backpropogation? Discrimi nator scalar update Chatbot En De scalar Encoder Can we do backpropogation? B A A Tuning generator a little bit will not change the output. A A A B B B Alternative: improved WGAN (ignoring sampling process) <BOS> A B

SeqGAN solution, using RL Use the output of discriminator as reward Update generator to increase discriminator = to get maximum reward Different from typical RL The discriminator would update Discrimi nator scalar update Chatbot En De Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Discriminator Score

g-step d-step New Objective: fake real discriminator (1/N) Σi=1..N R(hi ,xi ) log Pθ (xi |hi ) (h1, x1) R(h1, x1) (h2, x2) R(h2, x2) … (hN, xN) R(hN, xN) θ t+1  θt + ηΔγθ^t (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) d-step discriminator fake real

Rewarding a sentence vs word Consider example: hi =“what is your name” xi = “I do not know” Then logPθ(xi|hi) = logPθ(x1i|hi) +logPθ(x2i|hi,x1i) +logPθ(x3 i|hi,x1:2i) But if x = “I am Ming Li”, word I should have probability going up. If there are a lot of sentences to balance, this is usually ok. But when there is not enough samples, we can do reward at word level. I don’t know

Rewarding at word level Reward at sentence level was: Δγθ ≈ (1/N) Σi=1..N (R(hi ,xi )-b) Δlog Pθ (xi |hi ) Change to word level: Δγθ ≈ (1/N) Σi=1..N Σt=1..T (Q(hi ,x1:ti )-b)Δlog Pθ (xti |hi,x1:t-1i ) How to estimate Q? Monte Carlo.

Monte Carlo estimation of Q How to estimate Q(hi,x1:ti)? E.g. Q(“what is your name?”, “I”) Sample sentences starting with “I” using the current generator, and using the discriminator to evaluate xA = I am Ming Li D(hi, xA) = 1.0 xB = I am happy D(hi, xB) = 0.1 Q(hi, ”I”) = 0.5 xC = I don’t know D(hi, xC) = 0.1 xD = I am superman D(hi, xD) = 0.8

Experiments of Chatbot Reinforce = SeqGAN with reinforcement learning sentence level REGS Monte Carlo = SeqGAN with RL on word level

Li et al 2016 Example Results (Li, Monroe, Ritter, Galley, Gao, Jurafsky, EMNLP 2016)