Lecture 18: Temporal-Difference Learning

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Reinforcement Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Reinforcement Learning
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Batch RL Via Least Squares Policy Iteration
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
Chapter 5: Monte Carlo Methods
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning
RL for Large State Spaces: Policy Gradient
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Introduction to Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Machine Learning 5. Parametric Methods.
Reinforcement Learning Elementary Solution Methods
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Reinforcement Learning
Biomedical Data & Markov Decision Process
Temporal-Difference Learning and Monte Carlo Methods
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
یادگیری تقویتی Reinforcement Learning
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 7: Eligibility Traces
Chapter 4: Dynamic Programming
Reinforcement Learning
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Lecture 18: Temporal-Difference Learning TD prediction Relation of TD with Monte Carlo and Dynamic Programming Learning action values (the control problem) The exploration-exploitation dilemma

TD Prediction Policy Evaluation (the prediction problem): for a given policy p, compute the state-value function Recall: target: the actual return after time t target: an estimate of the return

Simple Monte Carlo T T T T T T T T T T T

Simplest TD Method T T T T T T T T T T T

Dynamic Programming T T T T T T T T T T T T T

TD Bootstraps and Samples Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update does not involve an expected value MC samples DP does not sample TD samples

Example: Driving Home

Driving Home Changes recommended by Monte Carlo methods (a=1) by TD methods (a=1)

Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?

Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.

Optimality of TD(0) Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to TD(0), but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small alpha. Constant-alpha MC also converges under these conditions, but to a different answer!

You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 0

You are the Predictor

You are the Predictor The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets

Learning An Action-Value Function

Control Methods TD can be used in generalized policy iteration Main idea: make the policy greedy after every trial Control methods aim at making the policy more greedy after every step Usually aimed at computing value functions

The Exploration/Exploitation Dilemma Suppose you form estimates The greedy action at t is You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always reduce exploring action value estimates

e-Greedy Action Selection { . . . the simplest way to try to balance exploration and exploitation

Softmax Action Selection Softmax action selection methods grade action probs. by estimated values. The most common softmax uses a Gibbs, or Boltzmann, distribution: “computational temperature”

Optimistic Initial Values All methods so far depend on the initial Q values, i.e., they are biased. Suppose instead we initialize the action values optimistically Then exploration arises because the agent is always expecting more than it gets This is an unbiased method

Sarsa: On-Policy TD Control Turn TD into a control method by always updating the policy to be greedy with respect to the current estimate: On every state s, choose action a according to policy p based on the current Q-values (e.g. epsilon-greedy, softmax) Go to state s’, choose action a’

Q-Learning: Off-Policy TD Control In this case, agent can behave according to any policy Algorithm is guaranteed to converge to correct values

Cliffwalking e-greedy, e = 0.1

Dopamine Neurons and TD Error W. Schultz et al. Universite de Fribourg

Summary TD prediction: one-step tabular model-free TD methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning These methods bootstrap and sample, combining aspects of DP and MC methods