Download presentation
Presentation is loading. Please wait.
Published byBrittney Matthews Modified over 7 years ago
1
Lecture 10 SeqGAN, Chatbot, Reinforcement Learning
2
Based on following two papers
L. Yu, W. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial Nets with policy gradient. AAAI, 2017. J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue generation. arXiv: v4, 2017. And HY Lee’s lecture notes.
3
Maximizing Expected Reward
Encoder Generator Human Update θ In place of discriminator We wish to maximize the expected reward: θ* = arg maxθ γθ , where, γθ = Σh P(h) Σx R(h,x) Pθ (x|h) ≈ (1/N) Σi=1..N R(hi ,xi ) sampling (h1,x1), … (hN,xN) But, now how do we do differentiation?
4
Policy gradient = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h)
≈ (1/N) Σi=1..N R(hi ,xi ) Δ γθ= Σh P(h) Σx R(h,x) ΔPθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) ΔPθ (x|h) / Pθ (x|h) = Σh P(h) Σx R(h,x) Pθ (x|h) Δlog Pθ (x|h) sampling ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) But how do we do this?
5
Policy gradient Gradient ascent: θnew θold + η Δγθ^old
Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Note: 1. Without R(h,x), this is max. likelihood. 2. Without R(h,x), we know how to do this. 3. Too approximate this, we can: if R(hi,xi)=k, repeat (hi,xi) k times. if R(hi,xi)=-k, repeat (hi,xi) k times, with -η
6
If R(hi,xi) is always positive:
Because it is probability … Ideal case Pθ(x|h) (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3) Not sampled Due to Sampling (h,x1) (h,x2) (h,x3) (h,x1) (h,x2) (h,x3)
7
Solution: subtract a baseline
If R(hi,xi) is always positive, we subtract a baseline (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) (1/N) Σi=1..N ( R(hi ,xi ) – b) Δlog Pθ (xi |hi ) (h,x1) (h,x2) (h,x3) Not sampled Pθ(x|h) Subtract a baseline (h,x1) (h,x2) (h,x3)
8
Chatbot by SeqGAN Let’s replace human by a discriminator with reward function: R(h,x) = λ1r1(h,x) + λ2r2(h,x) + λ3r3(h,x) Encourage continuation Say something new Semantic coherence
9
Chat-bot by conditional GAN
Chat-bot by conditional GAN Input sentence/history h En De response sentence x Chatbot Input sentence/history h Discriminator Real or fake response sentence x human dialogues
10
Can we do backpropogation?
Discrimi nator scalar update Chatbot En De scalar Encoder Can we do backpropogation? B A A Tuning generator a little bit will not change the output. A A A B B B Alternative: improved WGAN (ignoring sampling process) <BOS> A B
11
SeqGAN solution, using RL
Use the output of discriminator as reward Update generator to increase discriminator = to get maximum reward Different from typical RL The discriminator would update Discrimi nator scalar update Chatbot En De Δγθ ≈ (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) Discriminator Score
12
g-step d-step New Objective: fake real discriminator
(1/N) Σi=1..N R(hi ,xi ) log Pθ (xi |hi ) (h1, x1) R(h1, x1) (h2, x2) R(h2, x2) … (hN, xN) R(hN, xN) θ t+1 θt + ηΔγθ^t (1/N) Σi=1..N R(hi ,xi ) Δlog Pθ (xi |hi ) d-step discriminator fake real
13
Rewarding a sentence vs word
Consider example: hi =“what is your name” xi = “I do not know” Then logPθ(xi|hi) = logPθ(x1i|hi) +logPθ(x2i|hi,x1i) +logPθ(x3 i|hi,x1:2i) But if x = “I am Ming Li”, word I should have probability going up. If there are a lot of sentences to balance, this is usually ok. But when there is not enough samples, we can do reward at word level. I don’t know
14
Rewarding at word level
Reward at sentence level was: Δγθ ≈ (1/N) Σi=1..N (R(hi ,xi )-b) Δlog Pθ (xi |hi ) Change to word level: Δγθ ≈ (1/N) Σi=1..N Σt=1..T (Q(hi ,x1:ti )-b)Δlog Pθ (xti |hi,x1:t-1i ) How to estimate Q? Monte Carlo.
15
Monte Carlo estimation of Q
How to estimate Q(hi,x1:ti)? E.g. Q(“what is your name?”, “I”) Sample sentences starting with “I” using the current generator, and using the discriminator to evaluate xA = I am Ming Li D(hi, xA) = 1.0 xB = I am happy D(hi, xB) = 0.1 Q(hi, ”I”) = 0.5 xC = I don’t know D(hi, xC) = 0.1 xD = I am superman D(hi, xD) = 0.8
16
Experiments of Chatbot
Reinforce = SeqGAN with reinforcement learning sentence level REGS Monte Carlo = SeqGAN with RL on word level
17
Li et al 2016 Example Results
(Li, Monroe, Ritter, Galley, Gao, Jurafsky, EMNLP 2016)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.