Download presentation
Presentation is loading. Please wait.
Published byJason Fitzgerald Modified over 9 years ago
1
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their slides! R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
2
2 Chapter 6: Temporal Difference Learning pIntroduce Temporal Difference (TD) learning pFocus first on policy evaluation, or prediction, methods pThen extend to control methods Objectives of this chapter:
3
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 Simple Monte Carlo TTTTTTTTTT
4
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 TD Prediction Policy Evaluation (the prediction problem): for a given policy , compute the state-value function Recall: target: the actual return after time t target: an estimate of the return
5
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 Simplest TD Method TTTTTTTTTT
6
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 cf. Dynamic Programming T T T TTTTTTTTTT
7
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 TD Bootstraps and Samples pBootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps pSampling: update does not involve an expected value MC samples DP does not sample TD samples
8
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8 Example: Driving Home
9
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 Driving Home Changes recommended by Monte Carlo methods =1) Changes recommended by TD methods ( =1)
10
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 Advantages of TD Learning pTD methods do not require a model of the environment, only experience p TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome –Less memory –Less peak computation You can learn without the final outcome –From incomplete sequences pBoth MC and TD converge (under certain assumptions to be detailed later), but which is faster?
11
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 Random Walk Example Values learned by TD(0) after various numbers of episodes
12
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12 TD and MC on the Random Walk Data averaged over 100 sequences of episodes
13
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13 Optimality of TD(0) Batch Updating : train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to TD(0), but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small . Constant- MC also converges under these conditions, but to a difference answer!
14
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.
15
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 0
16
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16 You are the Predictor
17
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 You are the Predictor pThe prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a batch Monte Carlo method gets pIf we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets
18
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 Learning An Action-Value Function
19
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate:
20
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20 Windy Gridworld undiscounted, episodic, reward = –1 until goal
21
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 Results of Sarsa on the Windy Gridworld
22
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Q-Learning: Off-Policy TD Control
23
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 Cliffwalking greedy = 0.1
24
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24 Actor-Critic Methods pExplicit representation of policy as well as value function pMinimal computation to select actions pCan learn an explicit stochastic policy pCan put constraints on policies pAppealing as psychological and neural models
25
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 Actor-Critic Details
26
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26 Dopamine Neurons and TD Error W. Schultz et al. Universite de Fribourg
27
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27 Average Reward Per Time Step the same for each state if ergodic
28
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28 R-Learning
29
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Access-Control Queuing Task pn servers pCustomers have four different priorities, which pay reward of 1, 2, 3, or 4, if served pAt each time step, customer at head of queue is accepted (assigned to a server) or removed from the queue pProportion of randomly distributed high priority customers in queue is h pBusy server becomes free with probability p on each time step pStatistics of arrivals and departures are unknown n=10, h=.5, p=.06 Apply R-learning
30
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30 Afterstates pUsually, a state-value function evaluates states in which the agent can take an action. pBut sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. pWhy is this useful? pWhat is this in general?
31
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31 Summary pTD prediction pIntroduced one-step tabular model-free TD methods pExtend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning and R-learning pThese methods bootstrap and sample, combining aspects of DP and MC methods
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.