TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Lecture 18: Temporal-Difference Learning

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.

Reinforcement Learning

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.

Reinforcement Learning

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Application of Reinforcement Learning in Network Routing By Chaopin Zhu Chaopin Zhu.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Chapter 5: Monte Carlo Methods

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

RL for Large State Spaces: Policy Gradient

1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.

Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning,

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.

Retraction: I’m actually 35 years old. Q-Learning.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Reinforcement Learning Elementary Solution Methods

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Chapter 6: Temporal Difference Learning

Reinforcement Learning (1)

Chapter 5: Monte Carlo Methods

CMSC 471 – Spring 2014 Class #25 – Thursday, May 1

Reinforcement learning (Chapter 21)

An Overview of Reinforcement Learning

Biomedical Data & Markov Decision Process

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Reinforcement learning

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

یادگیری تقویتی Reinforcement Learning

Reinforcement Learning

CS 188: Artificial Intelligence Fall 2007

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2008

Chapter 7: Eligibility Traces

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES

Presentation transcript:

TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011

Introduction Temporal Difference Learning combines idea from the Monte Carlo Methods and Dynamic Programming Still sample the environment based on some policy Determine current estimate based on previous estimates Predictions are adjusted as time goes on to match other more accurate predications Temporal Difference Learning is popular for its simplicity and on-line applications

MC vs TD Constant-α MC: R(t) – actual return (reward) α – constant step-sized parameter Because the actual return is used, we must wait until the end of the episode to determine the update to V.

MC vs TD TD(0): r t+1 – observed award γ – discount rate TD method only waits for the next time step. At time t+1 a target can be formed and an update made using the observed reward, r t+1, and estimate, V(s t+1 ). In effect, TD(0) targets r t+1 + γV(s t+1 ) instead of R(t) in the MC method Called bootstrapping because update is based on previous estimate

Psuedo Code Initialize V(s) arbitrarily, and π to the policy to be evaluated Repeat (for each episode): Initialize s Repeat (for each step of episode): α <- action given by π for s Take action α observe reward r, and next state, s’ V(s) <- V(s) + α[r + γV(s’) – V(s)] s <- s’ until s is terminal

Advantages over MC Lends itself naturally to on-line applications MC must wait until end of the episode to adjust reward, TD only needs one time step Turns out this is critical consideration Some applications have long episodes or no episodes at all TD learns from every transition MC methods generally discount or throw out episodes where an experimental action was taken TD converges faster than constant-α MC in practice No formal proof has been developed

Soundness Is TD sound? Yes, for any fixed policy the TD algorithm has been proven to V π, provided a sufficiently small constant step-size parameter, or if the step-size parameter decreases according to the usual stochastic approximation conditions.