Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

Similar presentations


Presentation on theme: "Lecture 10 SeqGAN, Chatbot, Reinforcement Learning"— Presentation transcript:

1 Lecture 10 SeqGAN, Chatbot, Reinforcement Learning

2 Based on following two papers
L. Yu, W. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial Nets with policy gradient. AAAI, 2017. J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue generation. arXiv: v4, 2017. And HY Lee’s lecture notes.

3 Maximizing Expected Reward
Encoder Generator Human Update θ In place of discriminator We wish to maximize the expected reward: θ* = arg maxθ γθ , where, γθ = Σh P(h) Σx R(h,x) Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) sampling (h1,x1), … (hN,xN) But, now how do we do differentiation?

4 Policy gradient = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h)
≈ (1/N) Σi=1..N R(hi ,xi ) Δ γθ= Σh P(h) Σx R(h,x) ΔPθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) ΔPθ (x|h) / Pθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h) sampling ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) But how do we do this?

5 Policy gradient Gradient ascent: θnew  θold + η Δγθ^old
Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Note: 1. Without R(h,x), this is max. likelihood. 2. Without R(h,x), we know how to do this. 3. Too approximate this, we can: if R(hi,xi)=k, repeat (hi,xi) k times. if R(hi,xi)=-k, repeat (hi,xi) k times, with -η

6 If R(hi,xi) is always positive:
Because it is probability … Ideal case Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3) Not sampled Due to Sampling (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)

7 Solution: subtract a baseline
If R(hi,xi) is always positive, we subtract a baseline (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) (1/N) Σi=1..N ( R(hi ,xi ) – b) Δlog Pθ (xi |hi ) (h,x1) (h,x2) (h,x3) Not sampled Pθ(x|h) Subtract a baseline (h,x1) (h,x2) (h,x3)

8 Chatbot by SeqGAN Let’s replace human by a discriminator with reward function: R(h,x) = λ1r1(h,x) + λ2r2(h,x) + λ3r3(h,x) Encourage continuation Say something new Semantic coherence

9 Chat-bot by conditional GAN
Chat-bot by conditional GAN Input sentence/history h En De response sentence x Chatbot Input sentence/history h Discriminator Real or fake response sentence x human dialogues

10 Can we do backpropogation?
Discrimi nator scalar update Chatbot En De scalar Encoder Can we do backpropogation? B A A Tuning generator a little bit will not change the output. A A A B B B Alternative: improved WGAN (ignoring sampling process) <BOS> A B

11 SeqGAN solution, using RL
Use the output of discriminator as reward Update generator to increase discriminator = to get maximum reward Different from typical RL The discriminator would update Discrimi nator scalar update Chatbot En De Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Discriminator Score

12 g-step d-step New Objective: fake real discriminator
(1/N) Σi=1..N R(hi ,xi ) log Pθ (xi |hi ) (h1, x1) R(h1, x1) (h2, x2) R(h2, x2) (hN, xN) R(hN, xN) θ t+1  θt + ηΔγθ^t (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) d-step discriminator fake real

13 Rewarding a sentence vs word
Consider example: hi =“what is your name” xi = “I do not know” Then logPθ(xi|hi) = logPθ(x1i|hi) +logPθ(x2i|hi,x1i) +logPθ(x3 i|hi,x1:2i) But if x = “I am Ming Li”, word I should have probability going up. If there are a lot of sentences to balance, this is usually ok. But when there is not enough samples, we can do reward at word level. I don’t know

14 Rewarding at word level
Reward at sentence level was: Δγθ ≈ (1/N) Σi=1..N (R(hi ,xi )-b) Δlog Pθ (xi |hi ) Change to word level: Δγθ ≈ (1/N) Σi=1..N Σt=1..T (Q(hi ,x1:ti )-b)Δlog Pθ (xti |hi,x1:t-1i ) How to estimate Q? Monte Carlo.

15 Monte Carlo estimation of Q
How to estimate Q(hi,x1:ti)? E.g. Q(“what is your name?”, “I”) Sample sentences starting with “I” using the current generator, and using the discriminator to evaluate xA = I am Ming Li D(hi, xA) = 1.0 xB = I am happy D(hi, xB) = 0.1 Q(hi, ”I”) = 0.5 xC = I don’t know D(hi, xC) = 0.1 xD = I am superman D(hi, xD) = 0.8

16 Experiments of Chatbot
Reinforce = SeqGAN with reinforcement learning sentence level REGS Monte Carlo = SeqGAN with RL on word level

17 Li et al 2016 Example Results
(Li, Monroe, Ritter, Galley, Gao, Jurafsky, EMNLP 2016)


Download ppt "Lecture 10 SeqGAN, Chatbot, Reinforcement Learning"

Similar presentations


Ads by Google