Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Reinforcement Learning
Reinforcement learning (Chapter 21)
Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Making complex decisions
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Teaching Style COSC 6368 Teaching Style COSC 6368
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning
Planning to Maximize Reward: Markov Decision Processes
Reinforcement Learning
Framework: Agent in State Space
CS 188: Artificial Intelligence Fall 2007
Example: Simplified PD World
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning
Presentation transcript:

Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1 5. Active Reinforcement Learning 6. Applications 7. Summary Road Map Reinforcement Learning: Coverage in 2014:This Slideshow : Topic18b’.pptx, Sutton Video: may be: (Introduction to RL) Q&A: Kaebling Paper May be video: Helicopter Control — talk at ICML

Eick: Reinforcement Learning. Introduction Supervised Learning: Example Class Reinforcement Learning: Situation Reward …

Eick: Reinforcement Learning. Examples Playing chess: Reward comes at end of game Ping-pong: Reward on each point scored Animals: Hunger and pain - negative reward food intake – positive reward

Eick: Reinforcement Learning. Framework: Agent in State Space 123 R=+5 6 R=  9 9 R=  R=+4 5 R= e e s s s nw x/0.7 w n sw x/0.3 n s s Problem: What actions should an agent choose to maximize its rewards ? ne Example: XYZ-World Remark: no terminal states

123 R=+5 6 R=  9 9 R=  R=+4 5 R= e e s s s nw x/0.7 w n sw x/0.3 n s s ne XYZ-World: Discussion Problem 12 (3.3, 0.5) (3.2, -0.5) (0.6, -0.2) Bellman TD P P : Explanation of discrepancies TD for P/Bellman: Most significant discrepancies in states 3 and 8; minor in state 10 P chooses worst successor of 8; should apply operator x instead P should apply w in state 6, but only does it only in 2/3 of the cases; which affects the utility of state 3 The low utility value of state 8 in TD seems to lower the utility value of state 10  only a minor discrepancy I tried hard but: any better explanations?

R= R=  R=  R= R= e e s s s nw x/0.7 w n sw x/0.3 n s s ne XYZ-World: Discussion Problem 12 Bellman Update  =0.2 Discussion on using Bellman Update for Problem 12: No convergence for  =1.0; utility values seem to run away! State 3 has utility 0.58 although it gives a reward of +5 due to the immediate penalty that follows; we were able to detect that. Did anybody run the algorithm for other  e.g. 0.4 or 0.6 values; if yes, did it converge to the same values? Speed of convergence seems to depend on the value of .

123 R=+5 6 R=  9 9 R=  R=+4 5 R= e e s s s nw x/0.7 w n sw x/0.3 n s s ne XYZ-World: Discussion Problem 12 (0.57, -0.65) (-0.50, 0.47) (-0.18, -0.12) TD TD inverse R P : Other observations: The Bellman update did not converge for  =1 The Bellman update converged very fast for  =0.2 Did anybody try other values for  (e.g. 0.6)? The Bellman update suggest a utility value for 3.6 for state 5; what does this tell us about the optimal policy? E.g. is optimal? TD reversed utility values quite neatly when reward were inversed; x become –x+  with  [-0.08,0.08]. (2.98, -2.99)

Eick: Reinforcement Learning. XYZ-World --- Other Considerations R(s) might be known in advance or has to be learnt. R(s) might be probabilistic or not R(s) might change over time --- agent has to adapt. Results of actions might be known in advance or have to be learnt; results of actions can be fixed, or may change over time. One extreme: everything is known  Bellman Update; other extreme: nothing is known except states are observable, and available actions are known  TD-learning/Q-learning

Eick: Reinforcement Learning. Basic Notations T(s,a,s’) denotes the probability of reaching s’ when using action a in state s; it describes the transition model A policy  specifies what action to take for every possible state s  S R(s) denotes the reward an agent receives in state s Utility-based agents learn an utility function of states uses it to select actions to maximize the expected outcome utility. Q-learning, on the other hand, learns the expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a) Finally, reflex agents learn a policy that maps directly from states to actions RL-Glossary: Stanford Blackboard Video on this topic:

Eick: Reinforcement Learning. Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4.Project 1 5. Active Reinforcement Learning 6. Applications 7. Summary “ You use your brain or a computer” “ You learn about the world by performing actions in it”

Eick: Reinforcement Learning. 2. Bellman Equation Utility values obey the following equations: U  (s) = R(s) + γ  max a Σ s’ T ( s,a,s’ )  U  (s’) Can be solved using dynamic programming. Assumes knowledge of transition model T and reward R; the result is policy independent! Assume γ =1, for this lecture! “ measure utility in the future, after apply action a” Video on “foundations” of RL:

Eick: Reinforcement Learning. Bellman Update If we apply the Bellman update indefinitely often, we obtain the utility values that are the solution for the Bellman equation!! U i+1 (s) = R(s) + γ max a ( Σ s’ (T(s, a,s’)  U i (s’))) Some Equations for the XYZ World: U i+1 (1) = 0+ γ*U i (2) U i+1 (5) = 3+ γ *max(U i (7),U i (8)) U i+1 (8) =  + γ *max(U i (6),0.3*U i (7) + 0.7*U i (9) ) Bellman Update:

Eick: Reinforcement Learning. Reinforcement Learning 1. Introduction 2. Passive Reinforcement Learning 3. Temporal Difference Learning 4.Project1 5. Active Reinforcement Learning 6. Applications 7. Summary

Eick: Reinforcement Learning. 3. Temporal Difference Learning Idea: Use observed transitions to adjust values in observed states so that the comply with the constraint equation, using the following update rule: U Π (s)  U Π (s) + α [ R(s) + γ U Π (s’) - U Π (s) ] α is the learning rate; γ discount rate Temporal difference equation. No model assumption --- T and R have not to be known in advance.

Eick: Reinforcement Learning. Updating Estimations Based on Observations: New_Estimation = Old_Estimation*(1-  ) + Observed_Value*  New_Estimation= Old_Estimation + Observed_Difference*  Example: Measure the utility of a state s with current value being 2 and observed values are 3 and 3 and the learning rate is:  =0.2: Initial Utility Value:2 Utility Value after observing 3: 2x x0.2=2.2 Utility Value after observing 3,3: 2.2x0.8 +3x0.2= 2.36 Remark: This is the simplest form of TD-Learning

Eick: Reinforcement Learning. Temporal Difference Learning Idea: Use observed transitions to adjust values in observed states so that the comply with the constraint equation, using the following update rule: U Π (s)  (1- α)  U Π (s) + α  [ R(s) + γ U Π (s’)] α is the learning rate; γ discount rate Temporal difference equation. No model assumption --- T and R have not to be known in advance.

Eick: Reinforcement Learning. TD-Q-Learning Goal: Measure the utility of using action a in state s, denoted by Q(a,s); the following update formula is used every time an agent reaches state s’ from s using actions a: Q(a,s)  Q(a,s) + α [ R(s) + γ *max a’ Q(a’,s’)  Q(a,s) ] α is the learning rate;  is the discount factor Variation of TD-Learning Not necessary to know transition model T&R! History and Summary:

Eick: Reinforcement Learning. Example: Simplified PD World 12 P 3 D4 e n w s w Ungraded Homework State Space: {1, 2, 3, 4} Policy: agent starts is state 1 and applies operators: e-p-w-e-s-d-w-e-w-n-e-p-s-d Construct the q-table assuming  =0.4 and  =0.2! Solution will be discussed in the Jan. 27 lecture! e

Eick: Reinforcement Learning. TD-Learning Itself sri lakshmi chilluveru’s solution: After each step, this is what I have obtained: Q(e,1) = -0.4 Q(p,2) = 4.8 Q(w,2) = Q(e,1) = Q(s,2) = -0.4 Q(d,3) = 4.8 Q(w,3) = -0.4 Q(e,4) = Q(w,3) = Q(n,4) = Q(e,1) = -0.4=-0.64* *(-1+0.2x4.8) Q(p,2) = Q(s,2) = = =-0.4* *(-1+0.2x4.8) Q(d,3) = Next loop 4.8 will be 7.6, making q(e,1) and q(s,2) much larger!

Eick: Reinforcement Learning. TD-Learning Itself In general, TD Learning is a prediction technique In Project3 in 2013, we used it to predict state utilities where an agent obtains feedback about a world by exploring the states of the world by applying actions

Eick: Reinforcement Learning. Reinforcement Learning 1. Introduction 2. Passive Reinforcement Learning 3. Temporal Difference Learning 4.Project1 5. Active Reinforcement Learning 6. Applications 7. Summary

Eick: Reinforcement Learning. 5. Active Reinforcement Learning Now we must decide what actions to take. Optimal policy: Choose action with highest utility value. Is that the right thing to do?

Eick: Reinforcement Learning. Active Reinforcement Learning No! Sometimes we may get stuck in suboptimal solutions. Exploration vs Exploitation Tradeoff Why is this important? The learned model is not the same as the true environment.

Eick: Reinforcement Learning. Explore vs Exploit Exploitation: Maximize its reward vs Exploration: Maximize long-term well being.

Eick: Reinforcement Learning. Simple Solution to the Exploitation/Exploration Problem Choose a random action once in k times Otherwise, choose the action with the highest expected utility (k-1 out of k times)

Eick: Reinforcement Learning. Reinforcement Learning 1. Introduction 2. Passive Reinforcement Learning 3. Temporal Difference Learning 4. Project1 5. Active Reinforcement Learning 6. Applications 7. Summary

Eick: Reinforcement Learning. 6. Applications Robot Soccer Game Playing Checker playing program by Arthur Samuel (IBM) Update rules: change weights by difference between current states and backed-up value generating full look-ahead tree Backgrammon (5:37) :

Eick: Reinforcement Learning. Applications2 Elevator Control Helicopter Control Demo: ICML 2008 Talk:

Eick: Reinforcement Learning. 7. Summary Goal is to learn utility values of states and an optimal mapping from states to actions. If the world is completely known and does not change, we can determine utilities by solving Bellman Equations. Otherwise, temporal difference learning has to be used that updates values to match those of successor states. Active reinforcement learning learns the optimal mapping from states to actions.

Eick: Reinforcement Learning. Sutton ICML 2009 Video on 4 Key Ideas of RL Link : We will show the video 10:34-45:37 Discussion: What is unique about RL? Our Discussion found: Rewards instead of giving the correct answer Importance of Exploration and Sampling Adaptation: Dealing with Changing Words Other Things not mentioned on 4/15/2014: worried about the future/lifelong well-being; finding good policies,…