Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Reinforcement Learning
Fast Algorithms For Hierarchical Range Histogram Constructions
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Infinite Horizon Problems
Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Probably Approximately Correct Model (PAC)
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 5: Monte Carlo Methods
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
RL for Large State Spaces: Policy Gradient
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Sampling Distributions.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Efficiency in ML / AI 1.Data efficiency (rate of learning) 2.Computational efficiency (memory, computation, communication) 3.Researcher efficiency (autonomy,
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling Distributions.
Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Numerical Methods Solution of Equation.
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
Distributed Q Learning Lars Blackmore and Steve Block.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Retraction: I’m actually 35 years old. Q-Learning.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
Chapter 7: Eligibility Traces
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs

Off-policy Learning Learning about a way of behaving without behaving (exactly) that way Target policy must be part of source (behavior) policy  E.g., Q-learning learns about the greedy policy while following something more exploratory Learning about many macro-action policies at once We need off-policy learning!

RL Algorithm Space TDLinear FA Off-policy Linear TD( ) Q-learning, options stable We need all 3 But we can only get 2 at a time Tsitsiklis & Van Roy 1997 Tadic 2000 Baird 1995 Gordon 1995 NDP 1996 Boom!

Baird’s Counterexample Markov chain (no actions) All states updated equally often, synchronously Exact solution exists:  = 0 Initial  0 = (1,1,1,1,1,10,1) T 100% ±1)

Importance Sampling Re-weighting samples according to their “importance,” correcting for a difference in sampling distribution For example, any episode has probability under , so its importance is Corrects for oversampling under  ’

Naïve Importance Sampling Alg Update t = ( Regular-linear-TD( )-update t ) Converts off-policy to on-policy On-policy convergence theorem then applies Tsitsiklis & Van Roy, 1997 Tadic, 2000 But variance is high, convergence is very slow We can do better!

Approximate the action-value function: as a linear form: where is a feature vector representing s,a and is the modifiable parameter vector Linear Function Approximation

 Updating after each episode Linear TD( ) Per-Decision Importance-Sampled TD( )  (s t  1,a t  1 ) (s t  1,a t  1 ) 0 0 t The new Algorithm! (see paper for general )

Main Result Total change over episode for new algorithm Total change for conventional TD( ) Convergence Theorem (based on Tsitsiklis & Van Roy 1997) Under the usual assumptions, and one annoying assumption: new algorithm converges to the same    as on-policy TD( ) e.g., bounded episode length

The variance assumption is restrictive Consider a modified MDP with bounded episode length –We have data for this MDP –Our result assures good convergence for this –This solution can be made close to the sol’n to original problem –By choosing episode bound long relative to  or the mixing time Consider application to macro-actions –Here it is the macro-action that terminates –Termination is artificial, real process is unaffected –Yet all results directly apply to learning about macro-actions –We can choose macro-action termination to satisfy the variance condition But can often be satisfied with “artificial” terminations

Empirical Illustration Agent always starts at S Terminal states marked G Deterministic actions Behavior policy chooses up-down with prob. Target policy chooses up-down with If the algorithm is successful, it should give positive weight to rightmost feature, negative to the leftmost one

Trajectories of Two Components of  = 0.9  decreased  appears to converge as advertised Episodes x 100,000 µ leftmost,down µ leftmost,down µ rightmost,down * µ rightmost,down *

Comparison of Naïve and PD IS Algs Root Mean Squared Error Naive IS Per-Decision IS Log 2  = 0.9  constant (after 100,000 episodes, averaged over 50 runs) Precup, Sutton & Dasgupta, 2001

Can Weighted IS help the variance? Return to the tabular case, consider two estimators: i th return following s,a at time t IS correction product converges with finite variance iff the w i have finite variance converges with finite variance even if the w i have infinite variance Can this be extended to the FA case?

Restarting within an Episode We can consider episodes to start at any time This alters the weighting of states, –But we still converge, –And to near the best answer (for the new weighting)

Incremental Implementation At the start of each episode: On each step:

Conclusion First off-policy TD methods with linear FA –Certainly not the last –Somewhat greater efficiencies are undoubtedly possible But the problem is so important Can’t we do better? –Is there no other approach? –Something other than importance sampling? I can’t think of a credible alternative approach Perhaps experimentation in a nontrivial domain would suggest other possibilities...