Reinforcement Learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement learning (Chapter 21)
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Markov Decision Process (MDP)
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Artificial Intelligence Ch 21: Reinforcement Learning
Making complex decisions
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Teaching Style COSC 6368 Teaching Style COSC 6368
Markov Decision Processes
Reinforcement Learning
Artificial Intelligence
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Announcements Homework 3 due today (grace period through Friday)
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Artificial Intelligence
Chapter 2: Evaluative Feedback
CS 188: Artificial Intelligence Fall 2008
Framework: Agent in State Space
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning Dealing with Partial Observability
Reinforcement Learning
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Introduction Supervised Learning: Example Class Reinforcement Learning: … Situation Reward Situation Reward

Examples Playing chess: Reward comes at end of game Ping-pong: Reward on each point scored Animals: Hunger and pain - negative reward food intake – positive reward

Framework: Agent in State Space Remark: no terminal states Example: XYZ-World e e 1 2 3 R=+5 s s n w 6 R=-9 4 5 R=+3 sw ne s s n x/0.7 x/0.3 8 R=+4 9 R=-6 7 nw Problem: What actions should an agent choose to maximize its rewards? s 10

Motivation—PageRank Algorithm http://en.wikipedia.org/wiki/PageRank PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) where PR(A) is the PageRank of page A, PR(Ti) is the PageRank of pages Ti which link to page A, C(Ti) is the number of outbound links on page Ti and d is a damping factor which can be set between 0 and 1.

e e 1 2 3 R=+5 s s n w 6 R=-9 4 5 R=+3 sw s s n 8 R=+4 9 R=-6 7 nw s XYZ-World: Discussion Problem 12 TD P Bellman e e (3.3, 0.5) 1 2 3 R=+5 s s n I tried hard but: any better explanations? w 6 R=-9 4 5 R=+3 sw ne s s n (3.2, -0.5) x/0.7 x/0.3 8 R=+4 9 R=-6 7 nw s (0.6, -0.2) 10 Explanation of discrepancies TD for P/Bellman: Most significant discrepancies in states 3 and 8; minor in state 10 P chooses worst successor of 8; should apply operator x instead P should apply w in state 6, but only does it only in 2/3 of the cases; which affects the utility of state 3 The low utility value of state 8 in TD seems to lower the utility value of state 10  only a minor discrepancy P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.

XYZ-World: Discussion Problem 12 Bellman Update g=0.2 e e 10.145 20.72 30.58 R=+5 s s n w 6-8.27 R=-9 40.03 53.63 R=+3 sw ne s s n x/0.7 x/0.3 83.17 R=+4 9-5.98 R=-6 70.001 nw s 100.63 Discussion on using Bellman Update for Problem 12: No convergence for g=1.0; utility values seem to run away! State 3 has utility 0.58 although it gives a reward of +5 due to the immediate penalty that follows; we were able to detect that. Did anybody run the algorithm for other g e.g. 0.4 or 0.6 values; if yes, did it converge to the same values? Speed of convergence seems to depend on the value of g.

e e 1 2 3 R=+5 s s n w 6 R=-9 4 5 R=+3 sw s s n 8 R=+4 9 R=-6 7 nw s XYZ-World: Discussion Problem 12 TD inverse R TD e e (0.57, -0.65) 1 2 3 R=+5 s s n (2.98, -2.99) w 6 R=-9 4 5 R=+3 sw ne s s n x/0.7 x/0.3 8 R=+4 (-0.50, 0.47) 9 R=-6 7 nw s Other observations: The Bellman update did not converge for g=1 The Bellman update converged very fast for g=0.2 Did anybody try other values for g (e.g. 0.6)? The Bellman update suggest a utility value for 3.6 for state 5; what does this tell us about the optimal policy? E.g. is 1-2-5-7-4-1 optimal? TD reversed utility values quite neatly when reward were inversed; x become –x+u with u[-0.08,0.08]. (-0.18, -0.12) 10 P: 1-2-3-6-5-8-6-9-10-8-6-5-7-4-1-2-5-7-4-1.

XYZ-World --- Other Considerations R(s) might be known in advance or has to be learnt. R(s) might be probabilistic or not R(s) might change over time --- agent has to adapt. Results of actions might be known in advance or have to be learnt; results of actions can be fixed, or may change over time.

Example: The ABC-World To be used in Assignment2: Example: The ABC-World Remark: no terminal states e e 1 2 3 R=+5 n s n sw w w 6 R=-9 4 R=-1 5 R=-4 ne ne s s n x/0.9 x/0.1 8 R=-3 9 R=+8 7 nw Problem: What actions should an agent choose to maximize its rewards? s 10R=+9

Basic Notations T(s,a,s’) denotes the probability of reaching s’ when using action a in state s; it describes the transition model A policy p specifies what action to take for every possible state sS R(s) denotes the reward an agent receives in state s Utility-based agents learn an utility function of states uses it to select actions to maximize the expected outcome utility. Q-learning, on the other hand, learns the expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a) Finally, reflex agents learn a policy that maps directly from states to actions

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Passive Learning We assume the policy Π is fixed. In state s we always execute action Π(s) Rewards are given.

Expected Utility UΠ (s) = E [ Σt=0 γ R(st) | Π, S0 = s ] Expected sum of rewards when the policy is followed.

Direct Utility Estimation Convert the problem to a supervised learning problem: (1,1)  U = 0.72 (2,1)  U = 0.68 … Learn to map states to utilities. Problem: utilities are not independent of each other!

Bellman Equation Utility values obey the following equations: Incorrect formula replaced on March 10, 2006 Bellman Equation Utility values obey the following equations: U (s) = R(s) + γ*maxaΣs’ T(s,a,s’)*U (s’) Assume γ =1, for this lecture! Can be solved using dynamic programming. Assumes knowledge of transition model T and reward R; the result is policy independent!

Example U(1,3) = 0.84 U(2,3) = 0.92 We hope to see that: The value is 0.88. Current value is a bit low and we must increase it. (1,3) (2,3)

Bellman Update (Section 17.2 of textbook) If we apply the Bellman update indefinitely often, we obtain the utility values that are the solution for the Bellman equation!! Ui+1(s) = R(s) + γ maxa(Σs’(T(s,a,s’)*Ui(s’))) Some Equations for the XYZ World: Ui+1(1) = 0+ γ*Ui(2) Ui+1(5) = 3+ γ *max(Ui(7),Ui(8)) Ui+1(8) = 4+ γ *max(Ui(6),0.3*Ui(7) + 0.7*Ui(9) ) Bellman Update:

Updating Estimations Based on Observations: New_Estimation = Old_Estimation*(1-) + Observed_Value* New_Estimation= Old_Estimation + Observed_Difference* Example: Measure the utility of a state s with current value being 2 and observed values are 3 and 3 and the learning rate is 0.2: Initial Utility Value:2 Utility Value after observing 3: 2x0.8 + 3x0.2=2.2 Utility Value after observing 3,3: 2.2x0.8 +3x0.2= 2.36

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Temporal Difference Learning Idea: Use observed transitions to adjust values in observed states so that the comply with the constraint equation, using the following update rule: UΠ (s)  UΠ (s) + α [ R(s) + γ UΠ (s’) - UΠ (s) ] α is the learning rate; γ discount rate Temporal difference equation. No model assumption --- T and R have not to be known.

TD-Q-Learning α is the learning rate; g is the discount factor Goal: Measure the utility of using action a in state s, denoted by Q(a,s); the following update formula is used every time an agent reaches state s’ from s using actions a: Q(a,s)  Q(a,s) + α [ R(s) + γ*maxa’Q(a’,s’) - Q(a,s) ] α is the learning rate; g is the discount factor Variation of TD-Learning Not necessary to know transition model T!

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Active Reinforcement Learning Now we must decide what actions to take. Optimal policy: Choose action with highest utility value. Is that the right thing to do?

Active Reinforcement Learning No! Sometimes we may get stuck in suboptimal solutions. Exploration vs Exploitation Tradeoff Why is this important? The learned model is not the same as the true environment.

Explore vs Exploit Exploitation: Maximize its reward vs Exploration: Maximize long-term well being.

Simple Solution to the Exploitation/Exploration Problem Choose a random action once in k times Otherwise, choose the action with the highest expected utility (k-1 out of k times)

Another Solution --- Combining Exploration and Exploitation U+ (s)  R(s) + γ*maxaf(u,n) u=Ss’(T(s,a,s’)*U+(s’)); n=N(a,s) U+ (s) : optimistic estimate of utility N(a,s): number of times action a has been tried. f(u,n): exploration function (idea: returns the value u, if n is large, and values larger than u as n decreases) Example: f(u,n):= if n>navg then u else max(n/navg*u, uavg) navg being the average number of operator applications. Idea f: Utility of states/actions that have not been explored much is increased artificially.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Applications Game Playing Checker playing program by Arthur Samuel (IBM) Update rules: change weights by difference between current states and backed-up value generating full look-ahead tree

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Summary Goal is to learn utility values of states and an optimal mapping from states to actions. If the world is completely known and does not change, we can determine utilities by solving Bellman Equations. Otherwise, temporal difference learning has to be used that updates values to match those of successor states. Active reinforcement learning learns the optimal mapping from states to actions. What is machine learning?