580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Reinforcement Learning
RL for Large State Spaces: Value Function Approximation
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
The loss function, the normal equation,
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Reinforcement Learning
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 3: TD( ) and eligibility traces.
Reinforcement Learning
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Retraction: I’m actually 35 years old. Q-Learning.
Reinforcement Learning Elementary Solution Methods
CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning
Homework Schultz, Dayan, & Montague, Science, 1997
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
یادگیری تقویتی Reinforcement Learning
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Learning Theory Reza Shadmehr
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Reinforcement Learning (2)
CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES
Presentation transcript:

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning

Review A reinforcement learning problem is characterized by a collection of states and actions. These are connected by edges that indicate what the available actions from each state and what the transition probabilities from actions to the new states are. S AS Available actions Transition probabilities The goal of all reinforcement learning is to find the policy with the highest expected Return (sum of temporally discounted rewards). To find it, we can generally use the policy iteration: Policy evaluation Policy improvement Estimate

Policy improvement theorem Improving the policy locally for some states s, such that and leaving the policy for all others the same, ensures that above is true for all states. Proof: Assume for a moment, that the Graph does not have any loops (tree structure). Then we can make a improved policy by changing the policy of the first state and then follow the old policy. For the first state: For all other states: If the Graph has loops, then the value of later states will be changed as well. But for the states where we follow the old policy: We can extend this proof by induction to changes for multiple states.

In the homework you used A Monte-Carlo methods (50 steps) for policy evaluation Greedy, sub-greedy or softmax method for policy improvement. Greedy gets mostly stuck in a policy of going to the right. All others have a chance to learn the correct policy, but may not exploit the policy optimally in the end. Expected reward P(left first step)

A good strategy (gray line) is to start with high exploration in the beginning and then with a high exploitation in the end. In this example this is done by starting with a softmax-method of policy improvement and decreasing temperature parameter. Expected reward P(left first step)

Batch vs online learning When we learn a value function with a batch (Monte-Carlo) algorithm, we need to wait until N steps are done, then we can update. Temporal difference learning is a iterative way of learning a value function, such that you can change the value function every step. Let’s start like in LMS and see what gradient the batch–algorithm follows. Remember that the value function is expected return for the state. So we can find it by minimizing the difference between the value function and the measured return (by MC): In temporal difference learning we replace the Return on every step with the expected return given the current observation and Value-function (we are bootstrapping). This defines TD(0), the simplest way of temporal difference learning.

How can TD(0) do better than Batch??? Plotted is the squared error of the estimated value function (compared to the true value function) of the batch algorithm and temporal difference learning. Given the same amount of data, TD(0) does actually better than batch. How can this be? A B r=1 r=0 P=0.75 P=0.25 Assume the markov-process on the left. Say you see the state-reward episodes: B,1B,1B,1B,0A,0,B,1 Batch learning would assign A a value of 0, because the empirical return was always 0. Given the data, that is the maximum-likelihood estimate. TD will converge to V(A)=  This estimate is different, because it uses the knowledge A leads to B and that our problem is Markovian. This is sometimes called the certainty- equivalence estimate, because it assumes certainty about model structure.

Sarsa – on-policy evaluation Sarsa: Initialize Q, , choose a 1 For t=1:T Observe r t+1,s t+1 choose a t+1 Update Q Update , end Currently we are alternating between policy evaluation and optimization every N steps. But can we do policy improvement also step-by-step? The first step is to not do TD-learning on the state-value function, but on the action- state value function. Thus, despite the fact that we change policy, we are not throwing away the old value function (as in MC), but use it as a starting point for the new one.

Addiction as a computational process Gone Awry David A. Redish, Science 2004 Under natural circumstances, the temporal difference signal is the following: The idea is that the drug (especially dopaminergic drugs like cocaine) may induce a small temporal difference signal directly (D), such that: In the beginning the temporal difference signal is high, because of the high reward value of the drug (rational addiction theory). But with longer use, the reward value might sink, and negative consequences would normally reduce the non-adaptive behavior. But because d is always at least D, the behavior can not be unlearned.

The model predicts that with continued use, the drug- seeking behavior becomes more insensitive to contrasting reward. Increased wanting (not more liking)

Elasticity is a term from economics. It measures how much the tendency to buy products decreases, as the price increases. Because drug-seeking can not easily be unlearned, the behavior become less and less elastic with prolonged drug use. Decreased Elasticity

TD-nStep So far we have only backed up the temporal difference error by one step. That means, that we have to revisit that state again, such that the state BEFORE that rewarding state can increase it’s value function. However, we can equip our learner with a bigger memory, such that the back-up can be done by n-steps. The 1,2, and n-step TD learning rule are respectively: This means that the states, as long as n-back are eligible for a update. We will investigate this more in the homework and the next lecture.