12 June, 20151 STD( ): learning state temporal differences with TD( ) Lex Weaver Department of Computer Science Australian National University Jonathan.

Slides:

Advertisements

Similar presentations

Lecture 18: Temporal-Difference Learning

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

10/29/01Reinforcement Learning in Games 1 Colin Cherry Oct 29/01.

Reinforcement Learning

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Planning under Uncertainty

Reinforcement Learning Tutorial

Reinforcement Learning

Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.

Reinforcement Learning (1)

Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Evaluating Performance for Data Mining Techniques

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

On-line Discovery of Temporal-Difference Networks Takaki Makino, Toshihisa Takagi Division of Project Coordination University of Tokyo 1.

Reinforcement Learning

Evaluation Function in Game Playing Programs M1 Yasubumi Nozawa Chikayama & Taura Lab.

Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.

Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.

Reinforcement Learning

Fuzzy Reinforcement Learning Agents By Ritesh Kanetkar Systems and Industrial Engineering Lab Presentation May 23, 2003.

CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.

Neural Networks Chapter 7

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

Chapter 6 Neural Network.

Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Kalman Filter with Process Noise Gauss- Markov.

Is Ignoring Public Information Best Policy? Reinforcement Learning In Information Cascade Toshiji Kawagoe Future University - Hakodate and Shinichi Sasaki.

Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.

Introduction to Reinforcement Learning Hiren Adesara Prof: Dr. Gittens.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)

CSE 4705 Artificial Intelligence

Reinforcement Learning (1)

Reinforcement learning (Chapter 21)

Reinforcement Learning

Dr. Unnikrishnan P.C. Professor, EEE

Computer Vision Chapter 4

Deep Reinforcement Learning

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Spring 2006

Reinforcement Learning (2)

Reinforcement Learning (2)

What is Artificial Intelligence?

CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES

Presentation transcript:

12 June, STD( ): learning state temporal differences with TD( ) Lex Weaver Department of Computer Science Australian National University Jonathan Baxter WhizBang! Labs Pittsburgh PA USA

12 June, Reinforcement Learning Not supervised learning; no oracle providing the correct responses Learning by trial and error Feedback signal –a reward (or cost) being a measure of goodness (or badness) –which may be significantly delayed

12 June, Value Functions Associate a value, representing expected future rewards, with each system state. –Useful for choosing between actions leading to subsequent states. The (state  value) mapping is the true value function. For most interesting problems we cannot determine the true value function.

12 June, Value Function Approximation Approximate the true value function with one from a parameterised class of functions. Use some learning method to find the parameterisation giving the “closest” approximation. –back propagation (supervised learning) –genetic algorithms (unsupervised learning) –TD( ) (unsupervised learning)

12 June, TD( ) temporal difference: the difference between one approximation of expected future rewards, and a subsequent approximation TD( ) is a successful algorithm for reducing temporal differences (Sutton 1988) –TD-Gammon (Tesauro 1992, 1994) –KnightCap, TDLeaf( ) (Baxter et. al. 1998, 2000) –tunes parameters to most closely approximate (weighted least squares) the true value function for the system being observed

12 June, Relative State Values When using [approximate] state values for making decisions, it is the relative state values that matter, not the absolute values. TD( ) minimises absolute error without regard to the relativities between states. This can lead to problems, …

12 June, The Two-State System

12 June, STD( ) uses a TD( ) approach to minimise temporal differences in relative state values. –restricted to binary markov decision processes weights parameter updates wrt probabilities of state transitions –facilitates learning optimal policies even when observing sub-optimal policies

12 June, 20159

10 Previous Work: DT( ) Differential Training; Bertsekas 1997 –during training, two instances of the system evolve independently but in lock-step –use a TD( ) approach to minimise the temporal differences in relative (between the two instances) state values –many of the state-pairs seen in training may never need to be compared in reality, yet accommodating them can bias the function approximator and result policy degradation

12 June, TD( ) v STD( ) Experiments Two-State System –observing optimal, random, and bad policies STD converges to optimal, TD converges to bad policy Acrobot –single element feature vector & four element feature vector –observing both good and random policy STD yields good policies TD yields bad policies d

12 June, Evolution of the Function Approximator in the Two-State System with STD( )

12 June, Contributions & Future Work Shown theoretically & empirically that learning with TD( ) can result in policy degradation. Proposed & successfully demonstrated an algorithm, STD( ), for BMDPs that does not suffer from the policy degradation identified in TD( ). Analysed DT( ), and shown that it includes irrelevant state pairings in parameter updates, which can result in policy degradation. Extend STD( ) to general Markov Decisions Processes. Investigate the significance of the reweighting (wrt state transition probabilities) of the parameter updates.