Chapter 6: Temporal Difference Learning

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 5: Monte Carlo Methods
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
Chapter 7: Eligibility Traces
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Introduction to Reinforcement Learning
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
INTRODUCTION TO Machine Learning
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Retraction: I’m actually 35 years old. Q-Learning.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
Biomedical Data & Markov Decision Process
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
یادگیری تقویتی Reinforcement Learning
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
Chapter 4: Dynamic Programming
Presentation transcript:

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

TD Prediction Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function Recall: target: the actual return after time t Must wait until the end of the episode target: an estimate of the return R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Simple Monte Carlo T T T T T T T T T T T R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Simplest TD Method T T T T T T T T T T T R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

cf. Dynamic Programming T T T T T T T T T T T T T R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

TD Bootstraps and Samples Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update does not involve an expected value MC samples DP does not sample TD samples R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Example: Driving Home R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Driving Home Changes recommended by Monte Carlo methods (a=1) by TD methods (a=1) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and TD converge (under certain assumptions to be detailed later), but which is faster? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Random Walk Example Episodes terminate on either end with reward 0 (left) and 1 (right) Values learned by TD(0) after various numbers of episodes R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

TD and MC on the Random Walk Data averaged over 100 sequences of episodes R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Optimality of TD(0) Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to TD(0), but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small a. Constant-a MC also converges under these conditions, but to a different answer! It converges to sample averages of the actual returns at each visited state. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times. TD seems to be better than MC, but MC gives optimum estimate of mean square error over the actual returns. So why TD seems better than the optimum? In what sense? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 0 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

You are the Predictor R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

You are the Predictor The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Learning An Action-Value Function R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: The name SARSA comes from the sequence s(t) a(t) r(t+1) s(t+1) a(t+1) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Windy Gridworld undiscounted, episodic, reward = –1 until goal R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Results of Sarsa on the Windy Gridworld R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Results of Sarsa on the Windy Gridworld R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Results of Sarsa on the Windy Gridworld % Example 6.5 % states are numbered from 1 to 70 according to raw and column index % matrix is scanned columnwise clear all; close all; % rewards r=0*(1:70)-1; r(53)=0; % next state for action - going north S_prime(1,1)=1; S_prime(1,2:70)=(1:69); S_prime(1,1:7:64)=(1:7:64); Spr=reshape(S_prime,7,10); % wind effect Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)]; Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)]; S_prime(1,:)=reshape(Spr,1,70); % next state for action - going east Spr=(8:70); Spr=reshape(Spr,7,9); Spr=[Spr Spr(:,9)]; S_prime(2,:)=reshape(Spr,1,70); % next state for action - going south Spr=(1:70); Spr=reshape(Spr,7,10); Spr=[Spr(2:7,:); Spr(7,:)]; % wind effect Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)]; Spr(1,4:9)=(22:7:57); Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)]; S_prime(3,:)=reshape(Spr,1,70); % next state for action - going west Spr=[Spr(:,1) Spr(:,1:9)]; S_prime(4,:)=reshape(Spr,1,70); % initialize variables Qsa=zeros(4,70); alpha=0.9; epsi=0.1; gamma=0.9; num_episodes=170; t=1; R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Results of Sarsa on the Windy Gridworld while s~=53 epsi=1/t; % take action a, observe r, s_pr s_pr=S_prime(a,s); % chose action a_pr from s_pr using epsilon greedy policy chose_policy_pr=rand>epsi; if chose_policy_pr [val,a_pr]=max(Qsa(:,s_pr)); random=[random 0]; else a_pr=ceil(rand*(1-eps)*4); random=[random 1]; end; % Bellman equation Qsa(a,s)=Qsa(a,s)+alpha*(-1+gamma*Qsa(a_pr,s_pr)-Qsa(a,s)); % the following selection works better % Qsa(a,s)=Qsa(a_pr,s_pr)-1; s=s_pr; a=a_pr; episode_size(episode)=episode_size(episode)+1; path=[path s]; t=t+1; end; %while s % repeat for each episode for episode=1:num_episodes epsi=1/t; % intitialize state s=4; % chose action a from s using epsilon greedy policy chose_policy=rand>epsi; if chose_policy [val,a]=max(Qsa(:,s)); else a=ceil(rand*(1-eps)*4); end; % repeat for each step of episode episode_size(episode)=0; path=[ ]; random=[ ]; R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Results of Sarsa on the Windy Gridworld plot([0 cumsum(episode_size)],(0:num_episodes)); xlabel('Time step'); ylabel('Episode index'); title('Windy gridworld learning rate with eps=1/t, alpha=0.9'); figure(2) plot( episode_size(1:num_episodes)) xlabel('Episode index'); ylabel('Episode length in steps'); % determine the optimum policy for i=1:70 Optimum_policy(:,i)=(Qsa(:,i)==max(Qsa(:,i))); end; norm=sum(Optimum_policy); Optimum_policy(:,i)=Optimum_policy(:,i)./norm(i); end % Optimum path % path = % Columns 1 through 13 % 11 18 25 31 37 43 50 57 64 65 66 67 68 % Columns 14 through 15 % 61 53 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Q-Learning: Off-Policy TD Control Different than in on-policy TD (SARSA) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Cliffwalking Sarsa Q-learning Reward for all moves –1 Falling off the cliff –100 e-greedy, e = 0.1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Actor-Critic Details So with positive d the preference for this action is increased and opposite R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Dopamine Neurons and TD Error Amygdala Nucleus accumbens Hippocampus Striatum (Caudate nucleus) VTA & Substantia Nigra (not shown) W. Schultz et al. Universite de Fribourg R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

The CAR task In the CAR task, a rat is trained that a conditioned stimulus (CS), such as a light or sound, precedes an unconditioned stimulus (US) such as an electric shock. Under normal conditions, the rat will learn to do two things: First, it will learn to escape the shock by moving to the other side of the cage (for example). Secondly, it will eventually learn that the CS predicts the shock, and the rat will then avoid the shock entirely by taking evasive action in response to the CS. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

A computational model Following formal models of Reinforcement Learning (RL) (Sutton and Barto, 1998), we translate the CAR task into a Markov Decision Process (MDP). Each circle represents a different state the simulated rat can be in, and the arrows denote the outcome of each of the two possible actions. The shock of the US elicits a punishment of -1. The rat must learn to take the appropriate action in each state which minimises overall punishment. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Instead try to find the maximum reward per time step. R-Learning R-Learning is an off-policy control method for advanced RL when one neither discounts nor divides experience into distinct episodes. Instead try to find the maximum reward per time step. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Average Reward Per Time Step in R-Learning the same for each state if ergodic i.e. nonzero probability of reaching any state This is a transient value declining to zero as all states reach their final average reward r R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Access-Control Queuing Task Apply R-learning n servers Customers have four different priorities, which pay reward of 1, 2, 3, or 4, if served At each time step, customer at head of queue is accepted (assigned to a server) or removed from the queue Proportion of randomly distributed high priority customers in queue is h Busy server becomes free with probability p on each time step Statistics of arrivals and departures are unknown n=10, h=.5, p=.06 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why is this useful? We do not know the opponent’s response so determining the state value is difficult Action-value also do not work since different states and actions may result in the same board position What is this in general? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Summary TD prediction Introduced one-step tabular model-free TD methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning and R-learning These methods bootstrap and sample, combining aspects of DP and MC methods R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction