580.691 Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 3: TD( ) and eligibility traces.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Lecture 18: Temporal-Difference Learning

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

Dynamic Bayesian Networks (DBNs)

Discrete Time Markov Chains

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

The loss function, the normal equation,

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.

SES 2007 A Multiresolution Approach for Statistical Mobility Prediction of Unmanned Ground Vehicles 44 th Annual Technical Meeting of the Society of Engineering.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Review.

Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015.

Reinforcement Learning (1)

Radial Basis Function Networks

Computer vision: models, learning and inference

1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.

Drones Collecting Cell Phone Data in LA AdNear had already been using methods.

Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Formulating MDPs pFormulating MDPs Rewards Returns Values pEscalator pElevators.

Computer vision: models, learning and inference Chapter 19 Temporal models.

EM and expected complete log-likelihood Mixture of Experts

Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.

Chapter 13 Wiener Processes and Itô’s Lemma

1 Lesson 3: Choosing from distributions Theory: LLN and Central Limit Theorem Theory: LLN and Central Limit Theorem Choosing from distributions Choosing.

Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.

Reading – Chapter 10. Recursion The process of solving a problem by reducing it to smaller versions of itself Example: Sierpinski’s TriangleSierpinski’s.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.

Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning,

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.

Reinforcement Learning Control with Robust Stability Chuck Anderson, Matt Kretchmar, Department of Computer Science, Peter Young, Department of Electrical.

UAIG: Second Fall 2013 Meeting. Agenda  Introductory Icebreaker  How to get Involved with UAIG?  Discussion: Reinforcement Learning  Free Discussion.

Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,

Graphs A ‘Graph’ is a diagram that shows how things are connected together. It makes no attempt to draw actual paths or routes and scale is generally inconsequential.

Adjoint models: Theory ATM 569 Fovell Fall 2015 (See course notes, Chapter 15) 1.

Chapter 13 Wiener Processes and Itô’s Lemma 1. Stochastic Processes Describes the way in which a variable such as a stock price, exchange rate or interest.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

A Crash Course in Reinforcement Learning

Reinforcement Learning (1)

Chapter 5: Monte Carlo Methods

An Overview of Reinforcement Learning

Hidden Markov Models Part 2: Algorithms

Homework Schultz, Dayan, & Montague, Science, 1997

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Chapter 14 Wiener Processes and Itô’s Lemma

October 6, 2011 Dr. Itamar Arel College of Engineering

The loss function, the normal equation,

CS 188: Artificial Intelligence Fall 2008

Mathematical Foundations of BME Reza Shadmehr

Chapter 7: Eligibility Traces

Backtracking and Branch-and-Bound

Reinforcement Learning (2)

Reinforcement Learning (2)

Derivatives and Gradients

Presentation transcript:

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 3: TD( ) and eligibility traces

N-step We have seen the possible virtue of backing up the temporal difference error by more than one state. The n-step backup rule was: In your homework you should have come up with a graph like this: Learning rate

Complex backups The idea with complex backups is not to use just a n-step backup, but to use a certain mixture of backups. For example we could average a 2-step backup and a 4-step backup. One way of doing this is to use a geometric series with parameter to average all possible n-step backups. This is the forward view of the important algorithm TD( ). It is called the forward view, because we look from each state forward in time to see how the value function of this state is updated. We can see if =0, we only use the 1-step backup. So TD(0) is the temporal difference learning we have been using so far. If =1, we will only learn from the final return, that means that we do Monte-carlo.

Backward view of TD( ): Eligibility trace How do you best implement TD( )? The forward view tells us how every state is updated, but we would have to wait until the end until we can update anything. The backward view tells us, how we should broadcast the current temporal difference error to previous states. The key idea are eligibility traces, which keep track of how much past states should learn from the actual TD-error: Then we can implement the algorithm very easily:

This is how TD( ) does on the homework problem. Even though the 1-step solution was better at it’s optimal learning rate than any n-step backup, the TD(0.6) beats TD(0) at the optimal learning rate.

Forward view After the whole series is run, the net-change in state s will be (using the indicator function I(a,b), which is 1 if a=b and 0 otherwise: Backward view After the whole series is run, the net-change in state s will be (using the indicator function I(a,b), which is 1 if a=b and 0 otherwise: On every step, only dt is broadcasted. It is weighted by the eligibility trace that can be written as:

Aligning forward and backward view k t We can represent the backward view by a grid. At every time step t we go through all visited states so far and broadcast the error dt back. By changing the order of summation from summing up the rows to summing up the columns we can get to something that already looks very much like the forward view: To finally see the equivalence, we need to show that:

Now we can pull out all terms that depend on t+1, then we pull out everything that depends on t+2, etc. And this is exactly the last term in the backward view:

Advantage of Eligibility traces: They are a mixture between temporal difference learning (which assumes a specific Markov model structure) and Monte-carlo methods (which do not assume any model structure). They can exploit model structure for better learning (as TD(0)), but if the Markov property is violated, they also provide some robustness (like Monte Carlo). Eligibility traces are an elegant way of going from discrete to continuous time. In continuous time state transitions form a labeled point process and eligibility traces decay exponentially. Discussion question: What if the states are continuous? That is, what if there are a vector of continuous variables x that describe the state space? One example would be the position and velocity of the car in the car parking problem. How would we approximate V(x)?

Schultz, Dayan, & Montague, Science 1997

Schultz, Dayan, & Montague, Science 1997

Schultz, Dayan, & Montague, Science 1997

Schultz, Dayan, & Montague, Science 1997