Transferring Instances for Model-Based Reinforcement Learning Matthew E. Taylor Teamcore Department of Computer Science University of Southern California Joint work with Nicholas K. Jong, and Peter Stone Learning Agents Research Group Department of Computer Sciences University of Texas at Austin Lazaric citation Send to Peter & Nick
Inter-Task Transfer Learning tabula rasa can be unnecessarily slow Humans can use past information Soccer with different numbers of players Different state variables and actions Agents: leverage learned knowledge in novel/modified tasks Learn faster Larger and more complex problems become tractable
Model-Based RL vs. Model-Free RL Q-Learning, Sarsa, etc. Learn values of actions In example: ~256 actions Model-Based Dyna-Q, R-Max, etc. Learn effects of actions (“what is the next state?” → planning) In example: ~36 actions Start Goal Action 1: Return to Start Action 2: Move to Right Reward: +1 at Goal, 0 otherwise
Transferring Instances for Model Based REinforcement Learning TIMBREL Transferring Instances for Model Based REinforcement Learning Transfer between Model-learning RL algorithms Different state variables and actions Continuous state spaces In this paper, we use: Fitted R-Max [Jong and Stone, 2007] Generalized mountain car domain n. An ancient percussion instrument similar to a tambourine.
Instance Transfer χX χA Why Instances? Inter-task mappings Source Task They are the model for instance-based methods Source task may be learned with any method statet, actiont, rewardt, statet+1 statet+1, actiont+1, rewardt+1, statet+2 statet+2, actiont+2, rewardt+2, statet+3 … Environment Action State Reward Agent χA χX Inter-task mappings Target Task Utilize source task instances to approximate model when insufficient target task instances exist Environment Action’ State’ Reward’ state’t, action’t, reward’t, state’t+1 state’t+1, action’t+1, reward’t+1, state’t+2 state’t+2, action’t+2, reward’t+2, state’t+3 … Agent
Inter-Task Mappings χX χA χx: starget→ssource χA: atarget→asource Given state variable in target task (some x from s = x1, x2, … xn ) Return corresponding state variable in source task χA: atarget→asource Similar, but for actions Intuitive mappings exist in some domains (Oracle) Mappings can be learned (e.g., Taylor, Kuhlmann, and Stone (2008)) Source S A χX χA What for? Why intuitive (mention preliminary assumption) Target S’ A’
Generalized Mountain Car x, Left, Neutral, Right 3D Mountain Car x, y, , Neutral, West, East, South, North χX x, y → x , → χA Neutral → Neutral West, South → Left East, North → Right episodic task! Say learned policy
Fitted R-Max [Jong and Stone, 2007] Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action ? y x
Fitted R-Max [Jong and Stone, 2007] Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action y x
Fitted R-Max [Jong and Stone, 2007] Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action y x
Fitted R-Max [Jong and Stone, 2007] Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action y Utilize source task instances to approximate model when insufficient target task instances exist x
Compare Sarsa with Fitted R-Max Fitted R-max balances: sample complexity computational complexity asymptotic performance One per (state variable, action) combination Twenty 4-8-1 neural networks
‹x, y, vx, vy›, a, -1, ‹x’, y’, vx’, vy’› Instance Transfer Source Task statet, actiont, rewardt, statet+1 statet+1, actiont+1, rewardt+1, statet+2 statet+2, actiont+2, rewardt+2, statet+3 … ‹x, vx›, a, -1, ‹x’, vx’› … Environment Action State Reward Agent χA χX Inter-task mappings Target Task Environment Action’ State’ Reward’ ‹x, y, vx, vy›, a, -1, ‹x’, y’, vx’, vy’› … state't, action’t, reward’t, state’t+1 state’t+1, action’t+1, reward’t+1, state’t+2 state’t+2, action’t+2, reward’t+2, state’t+3 … Agent
Result #1: TIMBREL Succeeds Train in 2D task: 100 episodes Transform instances Learn in 3D task Transfer from 2D Task No Transfer
Result #2: Source Task Training Data Transfer from 20 source task episodes Transfer from 10 source task episodes Transfer from 5 source task episodes No Transfer
Result #3: Alternate Source Tasks Transfer from High Power 2D task No Transfer Transfer from No Goal 2D task
Selected Related Work Instance Transfer in Fitted Q Iteration Lazaric et. al, 2008 Transferring Regression Model of Transition Function Atkeson and Santamaria, 1997 Ordering Prioritized Sweeping via Transfer Sunmola and Wyatt, 2006 Bayesian Model Transfer Tanaka and Yamamura, 2003 Wilson et. al, 2007
Future Work Implement with other model-learning methods Dyna-Q R-Max Fitted Q Iteration Guard against U-shaped curve in Fitted R-Max? Examine more complex tasks Can TIMBREL improve performance of real world problems? Discrete tasks only
TIMBREL Conclusions Significantly increases speed of learning Results suggest less data needed to learn than Model-based RL without transfer Model-free RL without transfer Model-free RL with transfer Transfer performances depends on: Source task and target task similarity Amount of source task data collected