Reinforcement Learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Dialogue Policy Optimisation
Markov Decision Process
RL for Large State Spaces: Value Function Approximation
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Markov Decision Processes
Reinforcement Learning
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Well Posed Learning Problems Must identify the following 3 features –Learning Task: the thing you want to learn. –Performance measure: must know when you.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
Teaching Machines to Converse
Figure 5: Change in Blackjack Posterior Distributions over Time.
Introduction of Reinforcement Learning
Adversarial Learning for Neural Dialogue Generation
Making complex decisions
Reinforcement Learning
A Crash Course in Reinforcement Learning
Reinforcement learning (Chapter 21)
István Szita & András Lőrincz
Reinforcement Learning (1)
Reinforcement Learning in POMDPs Without Resets
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
AlphaGo with Deep RL Alpha GO.
Reinforcement learning (Chapter 21)
Reinforcement Learning
Markov Decision Processes
Deep reinforcement learning
Reinforcement Learning
Biomedical Data & Markov Decision Process
Reinforcement Learning
Integrating Learning of Dialog Strategies and Semantic Parsing
"Playing Atari with deep reinforcement learning."
Planning to Maximize Reward: Markov Decision Processes
Hidden Markov Models Part 2: Algorithms
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
RL for Large State Spaces: Value Function Approximation
CS 188: Artificial Intelligence Fall 2007
Chapter 2: Evaluative Feedback
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Introduction to Reinforcement Learning and Q-Learning
Deep Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
CS 416 Artificial Intelligence
Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Morteza Kheirkhah University College London
Presentation transcript:

Reinforcement Learning Human Language Technologies Giuseppe Attardi

What is RL? An approach on learning through feedback from actions in an unknown environment

Reward Hypothesis All goals can be described by the maximization of the expected cumulative reward The cumulative reward at each time step t can be written as: Gt = Rt+1 + Rt+2 + … Rewards are discounted by a rate g, to consider how much an agent cares about long/short term rewards: Gt = Rt+1 + g Rt+2 + g2 Rt+3 +…

RL Agent Major components: Policy: agent’s behavior function Value function: how good would be each state and/or action Model: agent’s prediction/representation of the environment

Policy A function that maps from state to action: Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P(a|s)

Value Function Q-value function gives expected future total reward from state s and action 𝑎 under policy 𝜋 with discount factor 𝛾 ∈ [0,1] Q𝜋(𝑠,𝑎) = E[Rt+1 + g Rt+2 + g2 Rt+3 +…| s, a] Which can be written as a Bellman equation Q𝜋(𝑠,𝑎) = E[Rt+1 + g Q𝜋(𝑠t+1, 𝑎’) | s, a] A Bellman equation is a recursion for expected rewards, whose optimal value function Q* satisfies the Bellman optimality equation: Q*(𝑠,𝑎) = E[Rt+1 + g maxa’Q*(𝑠t+1, 𝑎’) | s, a] which describes the reward for taking the action giving the highest expected return.

Three Approaches to RL Value Based Policy Based Model Based the goal is to optimize the value function V(s) Policy Based The goal is to directly optimize the policy function π(s) without using a value function. Model Based In model-based RL, we model the environment. This means we create a model of the behavior of the environment.

Q-learning Q-Learning is a value-based Reinforcement Learning algorithm. Q-Learning is a model-free reinforcement learning method, that only requires to have access to a set of samples, either collected online or offline. It uses Dynamic Programming by building a table (Q-table) to store accumulated estimates of cost/rewards.

Q-learning Algorithm Initialize Q(s, a) to arbitrary values Repeat until done Choose an action a in the current state s based on current estimates Q(s, ) Perform action a and observe the outcome state s’ and reward r Update Q by Q(s, a) = Q(s, a) + a[r + g maxa’ Q(s’, a’) – Q(s, a)]

Q-table table where we’ll calculate the maximum expected future reward, for each action at each state. Each Q-table score will be the maximum expected future reward that I’ll get if I take that action at that state with the best policy given. No policy is implemented. Instead, we just improve our Q-table to always choose the best action.

Example Q* learning with Frozen Lake See notebook: Q Learning with FrozenLake.ipynb In Jupyterhub folder HLT/ Deep_reinforcement_learning_Course/Q learning/

Deep RL Use deep neural network to approximate Optimized by SGD Policy Value function Model Optimized by SGD

Deep Q-Learning Represent value function by Q-network Q(𝑠, 𝑎, w) = Q*(𝑠, 𝑎) Optimal Q-values should obey Bellman equation Q𝜋(𝑠, 𝑎) = R + g maxa’E[Q*(𝑠’, 𝑎’) | s, a] Treat right-hand side as target network, given (𝑠,𝑎,𝑟,𝑠’), optimize MSE loss via SGD: loss = (R + maxa’Q(𝑠’, 𝑎’, w) - Q(𝑠, 𝑎, w))2 But diverges using neural networks due to: Correlations between samples Non-stationary targets

Which action to choose? A greedy policy is: 𝜋′ 𝑠 = argmax 𝑎 𝑄 ∗ (𝑠, 𝑎)

RL in NLP Article summarization Question answering Dialogue generation Dialogue System Knowledge-based QA Machine Translation Text generation

RL for Dialog Generation Despite the success of Seq2Seq models in dialogue generation, two problems emerge: Seq2Seq models are trained by predicting the next dialogue turn in a given conversational context using MLE. MLE is not suitable for teaching a machine to converse with humans, while providing interesting, diverse, and informative feedback that keeps users engaged System may get stuck in an infinite loop of repetitive responses See Li et al. 2016. Deep Reinforcement Learning for Dialogue Generation. EMNLP 2016. https://aclweb.org/anthology/D16-1127

Approach Simulation of two agents taking turns that explore state-action space and learning a policy Supervised learning for Seq2Seq models Mutual Information for pretraining policy model Dialogue Simulation between Two Agents

Pointwise Mutual Information Pointwise MI from previous sequence S and response T 𝑃𝑀𝐼 𝑆,𝑇 = log 𝑝(𝑆,𝑇) 𝑝 𝑆 𝑝(𝑇) Objective, find: 𝑇 = arg max 𝑇 { log 𝑝 𝑇|𝑆 −𝜆 log 𝑝(𝑇) }

Problems To solve these, the model needs: Baseline mutual information model (Lietal. 2015) A: Where are you going? (1) B: I’m going to the restroom. (2) A: See you later. (3) B: See you later. (4) A: See you later. (5) B: See you later. (6) A: how old are you? (1) B: I’m 16. (2) A: 16? (3) B: I don’t know what you are talking about. (4) A: You don’t know what you are saying. (5) B: I don’t know what you are talking about . (6) A: You don’t know what you are saying. (7) … To solve these, the model needs: integrate developer-defined rewards that better mimic the true goal of chatbot development model the long term influence of a generated response in an ongoing dialogue

Deep RL for Dialog Generation Definitions: Action: infinite since arbitrary-length sequences can be generated. State: A state is denoted by the previous two dialogue turns [𝑝i,𝑞i]. Reward: Ease of answering, Information Flow and Semantic Coherence

Ease of Answering avoid utterance with a dull response 𝑟 1 =− 1 𝑁 𝑆 𝑠𝜖𝑆 1 𝑁 𝑆 log 𝑝 𝑠𝑒𝑞2𝑠𝑒𝑞 (𝑠|𝑎) The S is a list of dull responses such as “I don’t know what you are talking about”, “I have no idea”, etc.

Information Flow Penalize semantic similarity between consecutive turns from the same agent. 𝑟 2 =− log cos ( ℎ 𝑝 𝑖 , ℎ 𝑝 𝑖 +1 ) =− log cos ℎ 𝑝 𝑖 ℎ 𝑝 𝑖 +1 ℎ 𝑝 𝑖 ℎ 𝑝 𝑖 +1 Where ℎ 𝑝 𝑖 and ℎ 𝑝 𝑖 +1 denote representations obtained from the encoder for two consecutive turns 𝑝i and 𝑝i+1

Semantic Coherence avoid situations in which the generated replies are highly rewarded but are ungrammatical or not coherent 𝑟 3 = 1 𝑁 𝑎 log 𝑝 𝑠𝑒𝑞2𝑠𝑒𝑞 𝑎 𝑞 𝑖 , 𝑞 𝑖 + 1 𝑁 𝑝 𝑖 log 𝑝 𝑠𝑒𝑞2𝑠𝑒𝑞 𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑𝑠 ( 𝑞 𝑖 |𝑎) The final reward for action a is a weighted sum of the rewards: 𝑟 𝑎, 𝑝 𝑖 , 𝑞 𝑖 = 𝜆 1 𝑟 1 + 𝜆 2 𝑟 2 +𝜆 3 𝑟 3 probability of generating response a given the previous dialogue utterances Backward probability of generating the previous dialogue utterance qi based on response ai

Dialog Simulation between two agents Using the simulated turns and reward, maximize the expected future reward.

Training Set 800k sequences from the OpenSubtitles dataset with the lowest likelihood of generating the response “i don’t know what you are taking about”

Experimental Results Pairwise human judgments Model Avg. # simulated turns Seq2Seq 2.68 Mutual information 3.40 RL 4.48 Setting RL wins MI wins Tie Single-turn general quality 40 36 24 Single-turn ease of answer 52 23 25 Multi-turn general quality 72 12 16 Pairwise human judgments Input Message Mutual Information RL How old are you? What’s your full name? I don’t want to go home tonight. Do you have any feelings for me? How much time do you have here? Shall we get started?. Do you play football? We’d have to talk to him. How come you never say it? I am 16. I have no idea. Really? I don’t know what you are talking about. Not long enough. Sorry, sir. Of course! Yes. No, i don’t I mean, he’s a good guy Because I don’t want to hurt you. I’m 16, why are you asking? What’s yours? Why? I don’t want to hurt your feelings. Ten seconds. We’ve got a lot of work to do here. Yes. I love football!. About what ? I don’t think it’s a good idea to say it.

Conclusions Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision-making. RL is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment. RL addresses the computational issues that arise when learning from interaction with an environment in order to achieve long-term goals. RL can be applied to NLP tasks helping to finding more diverse and coherent solutions driven by suitable rewards.