Transferring Instances for Model-Based Reinforcement Learning

Slides:



Advertisements
Similar presentations
Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –
Advertisements

Reinforcement Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Meta-Level Control in Multi-Agent Systems Anita Raja and Victor Lesser Department of Computer Science University of Massachusetts Amherst, MA
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
1. Algorithms for Inverse Reinforcement Learning 2
Towards Equilibrium Transfer in Markov Games 胡裕靖
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Reinforcement Learning
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Reinforcement Learning
Leveraging Human Knowledge for Machine Learning Curriculum Design Matthew E. Taylor teamcore.usc.edu/taylorm.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing.
Matthew E. Taylor 1 Autonomous Inter-Task Transfer in Reinforcement Learning Domains Matthew E. Taylor Learning Agents Research Group Department of Computer.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Transfer Learning and Intelligence: an Argument and Approach Matthew E. Taylor Joint work with: Gregory Kuhlmann and Peter Stone Learning Agents Research.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Reinforcement Learning
On-Line Markov Decision Processes for Learning Movement in Video Games
Semi-Supervised Clustering
Reinforcement Learning
A Crash Course in Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Reinforcement Learning
Policy Gradient in Continuous Time
"Playing Atari with deep reinforcement learning."
Continous-Action Q-Learning
Apprenticeship Learning via Inverse Reinforcement Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
Chap 8. Instance Based Learning
Reinforcement Learning
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 9: Planning and Learning
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
Designing Neural Network Architectures Using Reinforcement Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Unsupervised Perceptual Rewards For Imitation Learning
Department of Computer Science Ben-Gurion University
Modeling IDS using hybrid intelligent systems
Reinforcement Learning (2)
Markov Decision Processes
Instructor: Vincent Conitzer
Angel A. Cantu, Nami Akazawa Department of Computer Science
Markov Decision Processes
Continuous Curriculum Learning for RL
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

Transferring Instances for Model-Based Reinforcement Learning Matthew E. Taylor Teamcore Department of Computer Science University of Southern California Joint work with Nicholas K. Jong, and Peter Stone Learning Agents Research Group Department of Computer Sciences University of Texas at Austin Lazaric citation Send to Peter & Nick

Inter-Task Transfer Learning tabula rasa can be unnecessarily slow Humans can use past information Soccer with different numbers of players Different state variables and actions Agents: leverage learned knowledge in novel/modified tasks Learn faster Larger and more complex problems become tractable

Model-Based RL vs. Model-Free RL Q-Learning, Sarsa, etc. Learn values of actions In example: ~256 actions Model-Based Dyna-Q, R-Max, etc. Learn effects of actions (“what is the next state?” → planning) In example: ~36 actions Start Goal Action 1: Return to Start Action 2: Move to Right Reward: +1 at Goal, 0 otherwise

Transferring Instances for Model Based REinforcement Learning TIMBREL Transferring Instances for Model Based REinforcement Learning Transfer between Model-learning RL algorithms Different state variables and actions Continuous state spaces In this paper, we use: Fitted R-Max [Jong and Stone, 2007] Generalized mountain car domain n. An ancient percussion instrument similar to a tambourine.

Instance Transfer χX χA Why Instances? Inter-task mappings Source Task They are the model for instance-based methods Source task may be learned with any method statet, actiont, rewardt, statet+1 statet+1, actiont+1, rewardt+1, statet+2 statet+2, actiont+2, rewardt+2, statet+3 … Environment Action State Reward Agent χA χX Inter-task mappings Target Task Utilize source task instances to approximate model when insufficient target task instances exist Environment Action’ State’ Reward’ state’t, action’t, reward’t, state’t+1 state’t+1, action’t+1, reward’t+1, state’t+2 state’t+2, action’t+2, reward’t+2, state’t+3 … Agent

Inter-Task Mappings χX χA χx: starget→ssource χA: atarget→asource Given state variable in target task (some x from s = x1, x2, … xn ) Return corresponding state variable in source task χA: atarget→asource Similar, but for actions Intuitive mappings exist in some domains (Oracle) Mappings can be learned (e.g., Taylor, Kuhlmann, and Stone (2008)) Source S A χX χA What for? Why intuitive (mention preliminary assumption) Target S’ A’

Generalized Mountain Car x, Left, Neutral, Right 3D Mountain Car x, y, , Neutral, West, East, South, North χX x, y → x , → χA Neutral → Neutral West, South → Left East, North → Right episodic task! Say learned policy

Fitted R-Max [Jong and Stone, 2007] Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action ? y x

Fitted R-Max [Jong and Stone, 2007] Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action y x

Fitted R-Max [Jong and Stone, 2007] Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action y x

Fitted R-Max [Jong and Stone, 2007] Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action y Utilize source task instances to approximate model when insufficient target task instances exist x

Compare Sarsa with Fitted R-Max Fitted R-max balances: sample complexity computational complexity asymptotic performance One per (state variable, action) combination Twenty 4-8-1 neural networks

‹x, y, vx, vy›, a, -1, ‹x’, y’, vx’, vy’› Instance Transfer Source Task statet, actiont, rewardt, statet+1 statet+1, actiont+1, rewardt+1, statet+2 statet+2, actiont+2, rewardt+2, statet+3 … ‹x, vx›, a, -1, ‹x’, vx’› … Environment Action State Reward Agent χA χX Inter-task mappings Target Task Environment Action’ State’ Reward’ ‹x, y, vx, vy›, a, -1, ‹x’, y’, vx’, vy’› … state't, action’t, reward’t, state’t+1 state’t+1, action’t+1, reward’t+1, state’t+2 state’t+2, action’t+2, reward’t+2, state’t+3 … Agent

Result #1: TIMBREL Succeeds Train in 2D task: 100 episodes Transform instances Learn in 3D task Transfer from 2D Task No Transfer

Result #2: Source Task Training Data Transfer from 20 source task episodes Transfer from 10 source task episodes Transfer from 5 source task episodes No Transfer

Result #3: Alternate Source Tasks Transfer from High Power 2D task No Transfer Transfer from No Goal 2D task

Selected Related Work Instance Transfer in Fitted Q Iteration Lazaric et. al, 2008 Transferring Regression Model of Transition Function Atkeson and Santamaria, 1997 Ordering Prioritized Sweeping via Transfer Sunmola and Wyatt, 2006 Bayesian Model Transfer Tanaka and Yamamura, 2003 Wilson et. al, 2007

Future Work Implement with other model-learning methods Dyna-Q R-Max Fitted Q Iteration Guard against U-shaped curve in Fitted R-Max? Examine more complex tasks Can TIMBREL improve performance of real world problems? Discrete tasks only

TIMBREL Conclusions Significantly increases speed of learning Results suggest less data needed to learn than Model-based RL without transfer Model-free RL without transfer Model-free RL with transfer Transfer performances depends on: Source task and target task similarity Amount of source task data collected