Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement Learning

Similar presentations


Presentation on theme: "Reinforcement Learning"— Presentation transcript:

1 Reinforcement Learning
Human Language Technologies Giuseppe Attardi

2 What is RL? An approach on learning through feedback from actions in an unknown environment

3 Reward Hypothesis All goals can be described by the maximization of the expected cumulative reward The cumulative reward at each time step t can be written as: Gt = Rt+1 + Rt+2 + … Rewards are discounted by a rate g, to consider how much an agent cares about long/short term rewards: Gt = Rt+1 + g Rt+2 + g2 Rt+3 +…

4 RL Agent Major components: Policy: agent’s behavior function
Value function: how good would be each state and/or action Model: agent’s prediction/representation of the environment

5 Policy A function that maps from state to action:
Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P(a|s)

6 Value Function Q-value function gives expected future total reward
from state s and action 𝑎 under policy 𝜋 with discount factor 𝛾 ∈ [0,1] Q𝜋(𝑠,𝑎) = E[Rt+1 + g Rt+2 + g2 Rt+3 +…| s, a] Which can be written as a Bellman equation Q𝜋(𝑠,𝑎) = E[Rt+1 + g Q𝜋(𝑠t+1, 𝑎’) | s, a] A Bellman equation is a recursion for expected rewards, whose optimal value function Q* satisfies the Bellman optimality equation: Q*(𝑠,𝑎) = E[Rt+1 + g maxa’Q*(𝑠t+1, 𝑎’) | s, a] which describes the reward for taking the action giving the highest expected return.

7 Three Approaches to RL Value Based Policy Based Model Based
the goal is to optimize the value function V(s) Policy Based The goal is to directly optimize the policy function π(s) without using a value function. Model Based In model-based RL, we model the environment. This means we create a model of the behavior of the environment.

8 Q-learning Q-Learning is a value-based Reinforcement Learning algorithm. Q-Learning is a model-free reinforcement learning method, that only requires to have access to a set of samples, either collected online or offline. It uses Dynamic Programming by building a table (Q-table) to store accumulated estimates of cost/rewards.

9 Q-learning Algorithm Initialize Q(s, a) to arbitrary values
Repeat until done Choose an action a in the current state s based on current estimates Q(s, ) Perform action a and observe the outcome state s’ and reward r Update Q by Q(s, a) = Q(s, a) + a[r + g maxa’ Q(s’, a’) – Q(s, a)]

10 Q-table table where we’ll calculate the maximum expected future reward, for each action at each state. Each Q-table score will be the maximum expected future reward that I’ll get if I take that action at that state with the best policy given. No policy is implemented. Instead, we just improve our Q-table to always choose the best action.

11 Example Q* learning with Frozen Lake See notebook:
Q Learning with FrozenLake.ipynb In Jupyterhub folder HLT/ Deep_reinforcement_learning_Course/Q learning/

12 Deep RL Use deep neural network to approximate Optimized by SGD Policy
Value function Model Optimized by SGD

13 Deep Q-Learning Represent value function by Q-network
Q(𝑠, 𝑎, w) = Q*(𝑠, 𝑎) Optimal Q-values should obey Bellman equation Q𝜋(𝑠, 𝑎) = R + g maxa’E[Q*(𝑠’, 𝑎’) | s, a] Treat right-hand side as target network, given (𝑠,𝑎,𝑟,𝑠’), optimize MSE loss via SGD: loss = (R + maxa’Q(𝑠’, 𝑎’, w) - Q(𝑠, 𝑎, w))2 But diverges using neural networks due to: Correlations between samples Non-stationary targets

14 Which action to choose? A greedy policy is: 𝜋′ 𝑠 = argmax 𝑎 𝑄 ∗ (𝑠, 𝑎)

15 RL in NLP Article summarization Question answering Dialogue generation
Dialogue System Knowledge-based QA Machine Translation Text generation

16 RL for Dialog Generation
Despite the success of Seq2Seq models in dialogue generation, two problems emerge: Seq2Seq models are trained by predicting the next dialogue turn in a given conversational context using MLE. MLE is not suitable for teaching a machine to converse with humans, while providing interesting, diverse, and informative feedback that keeps users engaged System may get stuck in an infinite loop of repetitive responses See Li et al Deep Reinforcement Learning for Dialogue Generation. EMNLP

17 Approach Simulation of two agents taking turns that explore state-action space and learning a policy Supervised learning for Seq2Seq models Mutual Information for pretraining policy model Dialogue Simulation between Two Agents

18 Pointwise Mutual Information
Pointwise MI from previous sequence S and response T 𝑃𝑀𝐼 𝑆,𝑇 = log 𝑝(𝑆,𝑇) 𝑝 𝑆 𝑝(𝑇) Objective, find: 𝑇 = arg max 𝑇 { log 𝑝 𝑇|𝑆 −𝜆 log 𝑝(𝑇) }

19 Problems To solve these, the model needs:
Baseline mutual information model (Lietal. 2015) A: Where are you going? (1) B: I’m going to the restroom. (2) A: See you later. (3) B: See you later. (4) A: See you later. (5) B: See you later. (6) A: how old are you? (1) B: I’m 16. (2) A: 16? (3) B: I don’t know what you are talking about. (4) A: You don’t know what you are saying. (5) B: I don’t know what you are talking about . (6) A: You don’t know what you are saying. (7) To solve these, the model needs: integrate developer-defined rewards that better mimic the true goal of chatbot development model the long term influence of a generated response in an ongoing dialogue

20 Deep RL for Dialog Generation
Definitions: Action: infinite since arbitrary-length sequences can be generated. State: A state is denoted by the previous two dialogue turns [𝑝i,𝑞i]. Reward: Ease of answering, Information Flow and Semantic Coherence

21 Ease of Answering avoid utterance with a dull response
𝑟 1 =− 1 𝑁 𝑆 𝑠𝜖𝑆 1 𝑁 𝑆 log 𝑝 𝑠𝑒𝑞2𝑠𝑒𝑞 (𝑠|𝑎) The S is a list of dull responses such as “I don’t know what you are talking about”, “I have no idea”, etc.

22 Information Flow Penalize semantic similarity between consecutive turns from the same agent. 𝑟 2 =− log cos ( ℎ 𝑝 𝑖 , ℎ 𝑝 𝑖 +1 ) =− log cos ℎ 𝑝 𝑖 ℎ 𝑝 𝑖 ℎ 𝑝 𝑖 ℎ 𝑝 𝑖 +1 Where ℎ 𝑝 𝑖 and ℎ 𝑝 𝑖 +1 denote representations obtained from the encoder for two consecutive turns 𝑝i and 𝑝i+1

23 Semantic Coherence avoid situations in which the generated replies are highly rewarded but are ungrammatical or not coherent 𝑟 3 = 1 𝑁 𝑎 log 𝑝 𝑠𝑒𝑞2𝑠𝑒𝑞 𝑎 𝑞 𝑖 , 𝑞 𝑖 𝑁 𝑝 𝑖 log 𝑝 𝑠𝑒𝑞2𝑠𝑒𝑞 𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑𝑠 ( 𝑞 𝑖 |𝑎) The final reward for action a is a weighted sum of the rewards: 𝑟 𝑎, 𝑝 𝑖 , 𝑞 𝑖 = 𝜆 1 𝑟 1 + 𝜆 2 𝑟 2 +𝜆 3 𝑟 3 probability of generating response a given the previous dialogue utterances Backward probability of generating the previous dialogue utterance qi based on response ai

24 Dialog Simulation between two agents
Using the simulated turns and reward, maximize the expected future reward.

25 Training Set 800k sequences from the OpenSubtitles dataset with the lowest likelihood of generating the response “i don’t know what you are taking about”

26 Experimental Results Pairwise human judgments Model
Avg. # simulated turns Seq2Seq 2.68 Mutual information 3.40 RL 4.48 Setting RL wins MI wins Tie Single-turn general quality 40 36 24 Single-turn ease of answer 52 23 25 Multi-turn general quality 72 12 16 Pairwise human judgments Input Message Mutual Information RL How old are you? What’s your full name? I don’t want to go home tonight. Do you have any feelings for me? How much time do you have here? Shall we get started?. Do you play football? We’d have to talk to him. How come you never say it? I am 16. I have no idea. Really? I don’t know what you are talking about. Not long enough. Sorry, sir. Of course! Yes. No, i don’t I mean, he’s a good guy Because I don’t want to hurt you. I’m 16, why are you asking? What’s yours? Why? I don’t want to hurt your feelings. Ten seconds. We’ve got a lot of work to do here. Yes. I love football!. About what ? I don’t think it’s a good idea to say it.

27 Conclusions Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision-making. RL is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment. RL addresses the computational issues that arise when learning from interaction with an environment in order to achieve long-term goals. RL can be applied to NLP tasks helping to finding more diverse and coherent solutions driven by suitable rewards.


Download ppt "Reinforcement Learning"

Similar presentations


Ads by Google