October 6, 2011 Dr. Itamar Arel College of Engineering

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Planning under Uncertainty
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Chapter 5: Monte Carlo Methods
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
RL2: Q(s,a) easily gets us U(s) and pi(s) Leftarrow with alpha above = move towards target, weighted by learning rate Sum of alphas is infinity, sum of.
Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Reinforcement Learning Elementary Solution Methods
CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning (RL)
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Reinforcement Learning
Deep reinforcement learning
Biomedical Data & Markov Decision Process
Temporal-Difference Learning and Monte Carlo Methods
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
CS 188: Artificial Intelligence Fall 2007
Chapter 2: Evaluative Feedback
یادگیری تقویتی Reinforcement Learning
September 22, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Fall 2007
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning Dealing with Partial Observability
October 20, 2010 Dr. Itamar Arel College of Engineering
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES
Presentation transcript:

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 10: Temporal-Difference Learning October 6, 2011 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2011

Introduction to Temporal Learning (TD) & TD Prediction If one had to identify one idea as central and novel to RL, it would undoubtedly be temporal-difference (TD) learning Combination of ideas from DP and Monte Carlo Learns without a model (like MC), bootstraps (like DP) Both TD and Monte Carlo methods use experience to solve the prediction problem A simple every-visit MC method may be expressed as Let’s call this constant-a MC We will focus on the prediction problem (a.k.a. policy evaluation)  evaluating V(s) for a given policy target: the actual return after time t

target: an estimate of the return TD Prediction (cont.) Recall that in MC we need to wait until the end of the episode to update the value estimates The idea of TD is to do so every time step Simplest TD method, TD(0): Essentially, we are updating one guess based on another The idea is that we have a “moving target” target: an estimate of the return

Simple Monte Carlo T T T T T T T T T T T

Simplest TD Method T T T T T T T T T T T

Dynamic Programming T T T T T T T T T T T T T

Tabular TD(0) for estimating Vp

TD methods Bootstrap and Sample Bootstrapping: update involves an estimate (i.e. guess from a guess) Monte Carlo does not bootstrap Dynamic Programming bootstraps Temporal Different bootstraps Sampling: update does not involve an expected value Monte Carlo samples Dynamic Programming does not sample Temporal Difference samples

Example: Driving Home rewards Returns from each state

Example: Driving Home (cont.) Value of each state is its expected time-to-go Changes recommended by Monte Carlo methods (a=1) Changes recommended by TD methods (a=1)

Example: Driving Home (cont.) Is it really necessary to wait until the end of the episode to start learning? Monte Carlo says it is TD learning argues that learning can occur on-line Suppose, on another day, you again estimate when leaving your office that it will take 30 minutes to drive home, but then you get stuck in a massive traffic jam Twenty-five minutes after leaving the office you are still bumper-to-bumper on the highway You now estimate that it will take another 25 minutes to get home, for a total of 50 minutes Must you wait until you get home before increasing your estimate for the initial state? In TD you would be shifting your initial estimate from 30 minutes toward 50

Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental Agent learns a “guess from a guess” Agent can learn before knowing the final outcome Less memory Reduced peak computation Agent can learn without the final outcome From incomplete sequences Helps with applications that have very long episodes Both MC and TD converge (under certain assumptions to be detailed later), but which is faster? Currently unknown – generally TD does better on stochastic tasks

Random Walk Example In this example we empirically compare the prediction abilities of TD(0) and constant-a MC applied to the small Markov process: All episodes starts in state C Proceed one state, right or left with equal probability Termination: R = +1, L = 0 True values: V(C)=1/2, V(A)=1/6, V(B)=2/6 V(D)=4/6, V(E)=5/6

Random Walk Example (cont.) Data averaged over 100 sequences of episodes

Optimality of TD(0) Suppose only a finite amount of experience is available, say 10 episodes or 100 time steps Intuitively, we repeatedly present the experience until convergence is achieved Updates are made after a batch of training data Also called batch updating For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small a MC method also converges deterministically but to a different answer To better understand the different between MC and TD(0), we’ll consider the batch random walk

Optimality of TD(0) (cont.) After each new episode, all previous episodes were treated as a batch, and the algorithm was trained until convergence. All repeated 100 times. A key question is what would explain these two curves?

You are the Predictor Suppose you observe the following 8 episodes Q: What would you guess V(A) and V(B) to be? 1) A, 0, B, 0 2) B, 1 3) B, 1 4) B, 1 5) B, 1 6) B, 1 7) B, 1 8) B, 0

You are the Predictor (cont.) V(A) = ¾ is the answer that batch TD(0) gives The other reasonable answer is simply to say that A(0)=0 (Why?) This is the answer that MC gives If the process is Markovian, we expect that the TD(0) answer will produce lower error on future data, even though the Monte Carlo answer is better on the existing data

For MC, the prediction that best matches the training data is V(A)=0 TD(0) vs. MC For MC, the prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts This is called the certainty-equivalence estimate It is what TD(0) yields

Learning An Action-Value Function We now consider the use of TD methods for the control problem As with MC, we need to balance exploration and exploitation Again, two schemes: on-policy and off-policy We’ll start with on-policy, and learn action-value function

SARSA: On-Policy TD(0) Learning One can easily turn this into a control method by always updating the policy to be greedy with respect to the current estimate of Q(s,a)

Q-Learning: Off-Policy TD Control One of the most important breakthroughs in RL was the development of Q-Learning - an off-policy TD control algorithm (1989)

Q-Learning: Off-Policy TD Control (cont.) The learned action-value function, Q, directly approximates the optimal action-value function, Q* Converges as long as all states are visited and state-action values updated Why is it considered an off-policy control method? How expensive is it to implement?