R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Formulating MDPs pFormulating MDPs Rewards Returns Values pEscalator pElevators.

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
1 Eligibility Traces (ETs) Week #7. 2 Introduction A basic mechanism of RL. λ in TD(λ) refers to an eligibility trace. TD methods such as Sarsa and Q-learning.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 5: Monte Carlo Methods
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015.
Reinforcement Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Chapter 7: Eligibility Traces
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning,
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 3: TD( ) and eligibility traces.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
INTRODUCTION TO Machine Learning
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
Retraction: I’m actually 35 years old. Q-Learning.
CMSC 471 Fall 2009 MDPs and the RL Problem Prof. Marie desJardins Class #23 – Tuesday, 11/17 Thanks to Rich Sutton and Andy Barto for the use of their.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
Reinforcement Learning
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
Chapter 4: Dynamic Programming
Presentation transcript:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Formulating MDPs pFormulating MDPs Rewards Returns Values pEscalator pElevators pParking pObject recognition pParallel parking pShip steering pBioreactor pHelicoptor pGo pProtein Sequences pFootball pBaseball pTraveling salesman pQuake pScene Interpretation pRobot walking pPortfolio management pStock index prediction

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 Chapter 7: Eligibility Traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 N-step TD Prediction pIdea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 pMonte Carlo: pTD: Use V to estimate remaining return pn-step TD: 2 step return: n-step return: Mathematics of N-step TD Prediction

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 Learning with N-step Backups pBackup (on-line or off-line): pError reduction property of n-step returns pUsing this, you can show that n-step methods converge Maximum error using n-step return Maximum error using V

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 Random Walk Examples pHow does 2-step TD work here? pHow about 3-step TD?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 A Larger Example pTask: 19 state random walk pDo you think there is an optimal n? for everything?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8 Averaging N-step Returns  n-step methods were introduced to help with TD( ) understanding pIdea: backup an average of several returns e.g. backup half of 2-step and half of 4- step pCalled a complex backup Draw each component Label with the weights for that component One backup

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 Forward View of TD( )  TD( ) is a method for averaging all n-step backups weight by n-1 (time since visitation) -return:  Backup using -return:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 -return Weighting Function Until terminationAfter termination

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 Relation to TD(0) and MC  -return can be rewritten as:  If = 1, you get MC:  If = 0, you get TD(0) Until terminationAfter termination

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12 Forward View of TD( ) II pLook forward from each state to determine update from future states and rewards:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13 -return on the Random Walk pSame 19 state random walk as before  Why do you think intermediate values of are best?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 Backward View  Shout  t backwards over time  The strength of your voice decreases with temporal distance by 

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 Backward View of TD( ) pThe forward view was for theory pThe backward view is for mechanism pNew variable called eligibility trace On each step, decay all traces by  and increment the trace for the current state by 1 Accumulating trace

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16 On-line Tabular TD( )

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 Relation of Backwards View to MC & TD(0) pUsing update rule:  As before, if you set to 0, you get to TD(0)  If you set to 1, you get MC but in a better way Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to the end of the episode)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 Forward View = Backward View  The forward (theoretical) view of TD( ) is equivalent to the backward (mechanistic) view for off-line updating pThe book shows:  On-line updating with small  is similar Backward updatesForward updates algebra shown in book

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 On-line versus Off-line on Random Walk pSame 19 state random walk pOn-line performs better over a broader range of parameters

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20 Control: Sarsa( ) pSave eligibility for state-action pairs instead of just states

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 Sarsa( ) Algorithm

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Sarsa( ) Gridworld Example pWith one trial, the agent has much more information about how to get to the goal not necessarily the best way pCan considerably accelerate learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 Three Approaches to Q( ) pHow can we extend this to Q- learning? pIf you mark every state action pair as eligible, you backup over non-greedy policy Watkins: Zero out eligibility trace after a non-greedy action. Do max when backing up at first non- greedy choice.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24 Watkins’s Q( )

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 Peng’s Q( ) pDisadvantage to Watkins’s method: Early in learning, the eligibility trace will be “cut” (zeroed out) frequently resulting in little advantage to traces pPeng: Backup max action except at end Never cut traces pDisadvantage: Complicated to implement

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26 Naïve Q( ) pIdea: is it really a problem to backup exploratory actions? Never zero traces Always backup max at current action (unlike Peng or Watkins’s) pIs this truly naïve? pWorks well is preliminary empirical studies What is the backup diagram?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27 Comparison Task From McGovern and Sutton (1997). Towards a better Q( )  Compared Watkins’s, Peng’s, and Naïve (called McGovern’s here) Q( ) on several tasks. See McGovern and Sutton (1997). Towards a Better Q( ) for other tasks and results (stochastic tasks, continuing tasks, etc) pDeterministic gridworld with obstacles 10x10 gridworld 25 randomly generated obstacles 30 runs  = 0.05,  = 0.9, = 0.9,  = 0.05, accumulating traces

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28 Comparison Results From McGovern and Sutton (1997). Towards a better Q( )

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Convergence of the Q( )’s pNone of the methods are proven to converge. Much extra credit if you can prove any of them. pWatkins’s is thought to converge to Q *  Peng’s is thought to converge to a mixture of Q  and Q * pNaïve - Q * ?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30 Eligibility Traces for Actor-Critic Methods  Critic: On-policy learning of V . Use TD( ) as described before. pActor: Needs eligibility traces for each state-action pair. pWe change the update equation: pCan change the other actor-critic update: to where

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31 Replacing Traces pUsing accumulating traces, frequently visited states can have eligibilities greater than 1 This can be a problem for convergence pReplacing traces: Instead of adding 1 when you visit a state, set that trace to 1

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32 Replacing Traces Example pSame 19 state random walk task as before  Replacing traces perform better than accumulating traces over more values of

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33 Why Replacing Traces? pReplacing traces can significantly speed learning pThey can make the system perform well for a broader set of parameters pAccumulating traces can do poorly on certain types of tasks Why is this task particularly onerous for accumulating traces?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34 More Replacing Traces pOff-line replacing trace TD(1) is identical to first-visit MC pExtension to action-values: When you revisit a state, what should you do with the traces for the other actions? Singh and Sutton say to set them to zero:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35 Implementation Issues with Traces pCould require much more computation But most eligibility traces are VERY close to zero pIf you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 36 Variable  Can generalize to variable  Here is a function of time Could define

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 37 Conclusions pProvides efficient, incremental way to combine MC and TD Includes advantages of MC (can deal with lack of Markov property) Includes advantages of TD (using TD error, bootstrapping) pCan significantly speed learning pDoes have a cost in computation

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 38 The two views