Download presentation
Presentation is loading. Please wait.
Published byDonna Potter Modified over 6 years ago
1
Transferring Instances for Model-Based Reinforcement Learning
Matthew E. Taylor Teamcore Department of Computer Science University of Southern California Joint work with Nicholas K. Jong, and Peter Stone Learning Agents Research Group Department of Computer Sciences University of Texas at Austin Lazaric citation Send to Peter & Nick
2
Inter-Task Transfer Learning tabula rasa can be unnecessarily slow
Humans can use past information Soccer with different numbers of players Different state variables and actions Agents: leverage learned knowledge in novel/modified tasks Learn faster Larger and more complex problems become tractable
3
Model-Based RL vs. Model-Free RL
Q-Learning, Sarsa, etc. Learn values of actions In example: ~256 actions Model-Based Dyna-Q, R-Max, etc. Learn effects of actions (“what is the next state?” → planning) In example: ~36 actions Start Goal Action 1: Return to Start Action 2: Move to Right Reward: +1 at Goal, 0 otherwise
4
Transferring Instances for Model Based REinforcement Learning
TIMBREL Transferring Instances for Model Based REinforcement Learning Transfer between Model-learning RL algorithms Different state variables and actions Continuous state spaces In this paper, we use: Fitted R-Max [Jong and Stone, 2007] Generalized mountain car domain n. An ancient percussion instrument similar to a tambourine.
5
Instance Transfer χX χA Why Instances? Inter-task mappings Source Task
They are the model for instance-based methods Source task may be learned with any method statet, actiont, rewardt, statet+1 statet+1, actiont+1, rewardt+1, statet+2 statet+2, actiont+2, rewardt+2, statet+3 … Environment Action State Reward Agent χA χX Inter-task mappings Target Task Utilize source task instances to approximate model when insufficient target task instances exist Environment Action’ State’ Reward’ state’t, action’t, reward’t, state’t+1 state’t+1, action’t+1, reward’t+1, state’t+2 state’t+2, action’t+2, reward’t+2, state’t+3 … Agent
6
Inter-Task Mappings χX χA χx: starget→ssource χA: atarget→asource
Given state variable in target task (some x from s = x1, x2, … xn ) Return corresponding state variable in source task χA: atarget→asource Similar, but for actions Intuitive mappings exist in some domains (Oracle) Mappings can be learned (e.g., Taylor, Kuhlmann, and Stone (2008)) Source S A χX χA What for? Why intuitive (mention preliminary assumption) Target S’ A’
7
Generalized Mountain Car
x, Left, Neutral, Right 3D Mountain Car x, y, , Neutral, West, East, South, North χX x, y → x , → χA Neutral → Neutral West, South → Left East, North → Right episodic task! Say learned policy
8
Fitted R-Max [Jong and Stone, 2007]
Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action ? y x
9
Fitted R-Max [Jong and Stone, 2007]
Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action y x
10
Fitted R-Max [Jong and Stone, 2007]
Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action y x
11
Fitted R-Max [Jong and Stone, 2007]
Instance-based RL method [Ormoneit & Sen, 2002] Handles continuous state spaces Weights recorded transitions by distances Plans over discrete, abstract MDP Example: 2 state variables, 1 action y Utilize source task instances to approximate model when insufficient target task instances exist x
12
Compare Sarsa with Fitted R-Max
Fitted R-max balances: sample complexity computational complexity asymptotic performance One per (state variable, action) combination Twenty neural networks
13
‹x, y, vx, vy›, a, -1, ‹x’, y’, vx’, vy’›
Instance Transfer Source Task statet, actiont, rewardt, statet+1 statet+1, actiont+1, rewardt+1, statet+2 statet+2, actiont+2, rewardt+2, statet+3 … ‹x, vx›, a, -1, ‹x’, vx’› … Environment Action State Reward Agent χA χX Inter-task mappings Target Task Environment Action’ State’ Reward’ ‹x, y, vx, vy›, a, -1, ‹x’, y’, vx’, vy’› … state't, action’t, reward’t, state’t+1 state’t+1, action’t+1, reward’t+1, state’t+2 state’t+2, action’t+2, reward’t+2, state’t+3 … Agent
14
Result #1: TIMBREL Succeeds
Train in 2D task: 100 episodes Transform instances Learn in 3D task Transfer from 2D Task No Transfer
15
Result #2: Source Task Training Data
Transfer from 20 source task episodes Transfer from 10 source task episodes Transfer from 5 source task episodes No Transfer
16
Result #3: Alternate Source Tasks
Transfer from High Power 2D task No Transfer Transfer from No Goal 2D task
17
Selected Related Work Instance Transfer in Fitted Q Iteration
Lazaric et. al, 2008 Transferring Regression Model of Transition Function Atkeson and Santamaria, 1997 Ordering Prioritized Sweeping via Transfer Sunmola and Wyatt, 2006 Bayesian Model Transfer Tanaka and Yamamura, 2003 Wilson et. al, 2007
18
Future Work Implement with other model-learning methods
Dyna-Q R-Max Fitted Q Iteration Guard against U-shaped curve in Fitted R-Max? Examine more complex tasks Can TIMBREL improve performance of real world problems? Discrete tasks only
19
TIMBREL Conclusions Significantly increases speed of learning
Results suggest less data needed to learn than Model-based RL without transfer Model-free RL without transfer Model-free RL with transfer Transfer performances depends on: Source task and target task similarity Amount of source task data collected
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.