Download presentation
Presentation is loading. Please wait.
Published byBartholomew Hubbard Modified over 8 years ago
1
Matthew E. Taylor 1 Autonomous Inter-Task Transfer in Reinforcement Learning Domains Matthew E. Taylor Learning Agents Research Group Department of Computer Sciences University of Texas at Austin 6/24/2008
2
Matthew E. Taylor 2 Inter-Task Transfer Learning tabula rasa can be unnecessarily slow Humans can use past information –Soccer with different numbers of players Agents leverage learned knowledge in novel tasks
3
Matthew E. Taylor 3 Primary Questions Source S SOURCE, A SOURCE Target S TARGET, A T ARGET Is it possible to transfer learned knowledge? Possible to transfer without a providing a task mapping? Only consider reinforcement learning tasks
4
Matthew E. Taylor Reinforcement Learning (RL): Key Ideas Markov Decision Process (MDP): ⟨ SATR ⟩ Policy: π(s) = a Action-Value function: Q(s, a) = ℜ State variables: s = ⟨ x 1, x 2, … x n ⟩ Environment Agent Action StateReward MDP S: States in task A: Actions agent can take T: T(S, A) → S’ R: R(S) → ℜ 4
5
Matthew E. Taylor 5 Outline Reinforcement Learning Background Inter-Task Mappings Value Function Transfer MASTER: Learning Inter-Task Mappings Related Work Future Work and Conclusion
6
Matthew E. Taylor 6 Enabling Transfer Source Task Target Task Environment Agent Action T State T Reward T Environment Agent Action S State S Reward S Q S : S S ×A S → ℜ Q T : S T ×A T → ℜ
7
Matthew E. Taylor 7 Inter-Task Mappings SourceTarget Source
8
Matthew E. Taylor 8 Inter-Task Mappings χ x: s target→ s source –Given state variable in target task (some x from s=x 1, x 2, … x n) –Return corresponding state variable in source task χ A: a target→ a source –Similar, but for actions Intuitive mappings exist in some domains (Oracle) Used to construct transfer functional Target ⟨ x 1 …x n ⟩ {a 1 …a m } S T ARGET A T ARGET Source S SOURCE A SOURCE ⟨ x 1 …x k ⟩ {a 1 …a j } χxχx χAχA
9
Matthew E. Taylor K2K2 K3K3 T2T2 T1T1 K1K1 Both takers move towards player with ball Goal: Maintain possession of ball 5 agents 3 (stochastic) actions 13 (noisy & continuous) state variables Keeper with ball may hold ball or pass to either teammate Keepaway [Stone, Sutton, and Kuhlmann 2005] 4 vs. 3: 7 agents 4 actions 19 state variables
10
Matthew E. Taylor Keepaway Hand-coded χ A Hold 4v3 Hold 3v2 Pass1 4v3 Pass1 3v2 Pass2 4v3 Pass2 3v2 Pass3 4v3 Pass2 3v2 Actions in 4 vs. 3 have “similar” actions in 3 vs. 2 10 K2K2 K3K3 T2T2 T1T1 K2K2 K3K3 T2T2 T1T1 K1K1 K4K4 T3T3 K1K1 Pass1 4v3 Pass2 4v3 Pass3 4v3
11
Matthew E. Taylor 11 Define similar state variables in two tasks Example: distances from player with ball to teammates Keepaway Hand-coded χ X K2K2 K3K3 T2T2 T1T1 K1K1 K2K2 K3K3 T2T2 T1T1 K1K1 K4K4 T3T3
12
Matthew E. Taylor 12 Reinforcement Learning Background Inter-Task Mappings Value Function Transfer MASTER: Learning Inter-Task Mappings Related Work Future Work and Conclusion Outline
13
Matthew E. Taylor 13 Source S SOURCE, A SOURCE Target S TARGET, A T ARGET Value Function Transfer
14
Matthew E. Taylor 14 Value Function Transfer ρ( Q S (S S, A S ) ) = Q T (S T, A T ) Action-Value function transferred ρ is task-dependant: relies on inter-task mappings ρ Q not defined on S T and A T Source Task Target Task Environment Agent Action T State T Reward T Environment Agent Action S State S Reward S Q S : S S ×A S → ℜ Q T : S T ×A T → ℜ
15
Matthew E. Taylor Learning Keepaway Sarsa update –CMAC, RBF, and neural network approximation successful Q π (s,a): Predicted number of steps episode will last –Reward = +1 for every timestep 15
16
Matthew E. Taylor ’s Effect on CMACs 4 vs. 33 vs. 2 For each weight in 4 vs. 3 function approximator: o Use inter-task mapping to find corresponding 3 vs. 2 weight 16
17
Matthew E. Taylor Threshold: 8.5 Performance Target: no Transfer Target: with Transfer Target + Source: with Transfer Target: no transfer Target: with Transfer Two distinct scenarios: 1. Target Time Metric: Successful if target task learning time reduced Transfer Evaluation Metrics Set a threshold performance Majority of agents can achieve with learning 2. Total Time Metric : Successful if total (source + target) time reduced “Sunk Cost” is ignored Source task(s) independently useful AI Goal Effectively utilize past knowledge Only care about Target Source Task(s) not useful Engineering Goal Minimize total training 17
18
Matthew E. Taylor Value Function Transfer: Time to threshold in 4 vs. 3 Total Time No Transfer Target Task Time } 18
19
Matthew E. Taylor Value Function Transfer Flexibility Different Function Approximators –Radial Basis Function & Neural Network Different Actuators 19 Pass accuracy “Accurate” passers have normal actuators “Inaccurate” passers have less capable kick actuators Value Function Transfer also reduces target task time and total time: Inaccurate 3 vs. 2 → Inaccurate 4 vs. 3 Accurate 3 vs. 2 → Inaccurate 4 vs. 3 Inaccurate 3 vs. 2 → Accurate 4 vs. 3
20
Matthew E. Taylor Value Function Transfer Flexibility Different Function Approximators Different Actuators Different Keepaway Tasks –5 vs. 4, 6 vs. 5, 7 vs. 6 20
21
Matthew E. Taylor Value Function Transfer Flexibility Different Function Approximators Different Actuators Different Keepaway Tasks Partial Mappings K2K2 K3K3 T1T1 K2K2 K3K3 T2T2 T1T1 K1K1 K4K4 T3T3 K1K1 T2T2 Transfer Functional # 3 vs. 2 Episodes Avg. 4 vs. 3 Time none030.84 Full30009.12 Partial300012.00 21
22
Matthew E. Taylor Value Function Transfer Flexibility Different Function Approximators Different Actuators Different Keepaway Tasks Partial Mappings Different Domains –Knight Joust to 4 vs. 3 Keepaway 22 Goal: Travel from start to goal line 2 agents 3 actions 3 state variables Fully Observable Discrete State Space (Q-table with ~600 s,a pairs) Deterministic Actions Opponent moves directly towards player Player may move North, or take a knight jump to either side # Knight Joust Episodes 4 vs. 3 Time 030.84 25,00024.24 50,00018.90
23
Matthew E. Taylor Value Function Transfer Flexibility Different Function Approximators Different Actuators Different Keepaway Tasks Partial Mappings Different Domains –Knight Joust to 4 vs. 3 Keepaway –3 vs. 2 Flat Reward, 3 vs. 2 Giveaway 23 Source TaskEpisodes4 vs. 3 Time None30.84 3 vs. 2 Keepaway30009.12 3 vs. 2 Flat Reward300019.42 3 vs. 2 Giveaway300032.94
24
Matthew E. Taylor Transfer MethodSource RL Base Method Target RL Base Method Knowledge Transferred Value Function Transfer Temporal Difference Q-values or Q-value weights Q-Value ReuseTemporal Difference Function Approximator Policy TransferPolicy Search Neural Network weights TIMBRELAnyModel-LearningExperienced Instances Rule TransferAnyTemporal DifferenceRules Representation Transfer multiple 24 Transfer Methods
25
Matthew E. Taylor Empirical Evaluation Keepaway: 3 vs. 2, 4 vs. 3, 5 vs. 4, 6 vs. 5, 7 vs. 6 Server Job Scheduling –Autonomic Computing Task –Server processes jobs in a queue while new jobs arrive –Policy selects between jobs with different utility functions Source Job Types 1,2 Target Job Types 1-4 25
26
Matthew E. Taylor Empirical Evaluation Keepaway: 3 vs. 2, 4 vs. 3, 5 vs. 4, 6 vs. 5, 7 vs. 6 Server Job Scheduling –Autonomic Computing Task –Server processes jobs in a queue while new jobs arrive –Policy selects between jobs with different utility functions Mountain Car –2D –3D Cross-Domain Transfer –Ringworld to Keepaway –Knight’s Joust to Keepaway K2K2 K3K3 T2T2 T1T1 K1K1 # Actions # State Variables Discrete vs. Continuous Deterministic vs. Stochastic Fully vs. Partially Observable Single Agent vs. Multi-Agent
27
Matthew E. Taylor 27 Reinforcement Learning Background Inter-Task Mappings Value Function Transfer MASTER: Learning Inter-Task Mappings Related Work Future Work and Conclusion Outline
28
Matthew E. Taylor 28 Learning Task Relationships Sometimes task relationships are unknown Necessary for Autonomous Transfer But finding similarities (analogies) can be very hard! Key idea: –Agents may generate data (experience) in both tasks –Leverage existing machine learning techniques 2 Techniques, differ in amount of background knowledge
29
Matthew E. Taylor Context Steps to enable Autonomous transfer: 1.Select a relevant source task, given a target task 2.Learn how the source and target tasks are related 3.Effectively transfer knowledge between tasks Transfer is Feasible (step 3) Steps toward Finding Mappings between Tasks (step 2) –Leverage full QDBNs to search for mappings [Liu and Stone, 2006] –Test possible mappings on-line [Soni and Singh, 2006] –Mapping Learning via Classification ? 29
30
Matthew E. Taylor Context Steps to enable Autonomous transfer: 1.Select a relevant source task, given a target task 2.Learn how the source and target tasks are related 3.Effectively transfer knowledge between tasks Transfer is Feasible (step 3) Steps toward Finding Mappings between Tasks (step 2) –Leverage full QDBNs to search for mappings [Liu and Stone, 2006] –Test possible mappings on-line [Soni and Singh, 2006] –Mapping Learning via Classification 30 S, r, S’ SourceTarget S, A, r, S’ A Action Classifier S, A, r, S’ A →AA →A
31
Matthew E. Taylor MASTER Overview Modeling Approximate State Transitions by Exploiting Regression Goals: –Learn inter-task mapping between tasks –Minimize data complexity –No background knowledge needed Algorithm Overview: 1.Record data in source task 2.Record small amount of data in target task 3.Analyze data off-line to determine best mapping 4.Use mapping in target task Environment Agent Action T State T Reward T Target Task Environment Agent Action S State S Reward S Source Task MASTER
32
Matthew E. Taylor MASTER Algorithm Record observed (s source, a source, s’ source ) tuples in source task Record small number of (s target, a target, s’ target ) tuples in target task Learn one-step transition model, T(S T, A T ), for the target task: M (s target, a target ) → s’ target for every possible action mapping χ A for every possible state variable mapping χ X Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(s transformed, a transformed ) – s’ transformed ) 2 return χ A, χ X with lowest error Environment Agent Action T State T Reward T Target Task Environment Agent Action S State S Reward S Source Task MASTER
33
Matthew E. Taylor Pros: Very little target task data needed (sample complexity) Analysis for discovering mappings is off-line Cons: Exponential in # of state variables and actions Observations 33
34
Matthew E. Taylor Generalized Mountain Car 2D Mountain Car –x, –Left, Neutral, Right 3D Mountain Car (novel task) –x, y,, –Neutral, West, East, South, North 34
35
Matthew E. Taylor Generalized Mountain Car 2D Mountain Car –x,–x, –Left, Neutral, Right 3D Mountain Car (novel task) –x, y,, –Neutral, West, East, South, North χ X –x, y → x –, → χ A –Neutral → Neutral –West, South → Left –East, North → Right Both tasks: Episodic Scaled State Variables Sarsa CMAC function approximation 35
36
Matthew E. Taylor MASTER Algorithm Record observed (s source, a source, s’ source ) tuples in source task Record small number of (s target, a target, s’ target ) tuples in target task Learn one-step transition model, T(S,A), for the target task: M (s target, a target ) → s’ target for every possible action mapping χ A for every possible state variable mapping χ X Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(s transformed, a transformed ) – s’ transformed ) 2 return χ A, χ X with lowest error 36
37
Matthew E. Taylor MASTER and Mountain Car Record observed (x,, a 2D, x’, ’) tuples in 2D task Record small number of (x, y,,, a 3D, x’, y’, ’, ’) tuples in 3D task Learn one-step transition model, T(S,A), for the 3D task: M (x, y,,, a 3D ) →x’, y’, ’, ’ for every possible action mapping χ A for every possible state variable mapping χ X Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(s transformed, a transformed ) – s’ transformed ) 2 return χ A, χ X with lowest error 37
38
Matthew E. Taylor MASTER and Mountain Car Record observed (x,, a 2D, x’, ’) tuples in 2D task Record small number of (x, y,,, a 3D, x’, y’, ’, ’) tuples in 3D task Learn one-step transition model, T(S,A), for the 3D task: M (x, y,,, a 3D ) →x’, y’, ’, ’ Example: χ A {Neutral, West} → Neutral {South} → Left {East, North} → Right χ X {x, y, x} → x {y} → x 38 M(x, x, x, x, a 3D ) → x’, x’, x’, x’ (-0.50, 0.01, Right, -0.49, 0.02) (-0.50, -0.50, -0.50, 0.01, East, -0.49, -0.49, -0.49, 0.02) (-0.50, -0.50, -0.50, 0.01, North, -0.49, -0.49, -0.49, 0.02)
39
Matthew E. Taylor MASTER and Mountain Car Record observed (x,, a 2D, x’, ’) tuples in 2D task Record small number of (x, y,,, a 3D, x’, y’, ’, ’) tuples in 3D task Learn one-step transition model, T(S,A), for the 3D task: M (x, y,,, a 3D ) →x’, y’, ’, ’ for every possible action mapping χ A for every possible state variable mapping χ X Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(s transformed, a transformed ) – s’ transformed ) 2 return χ A, χ X with lowest error (of 240 possible mappings: 16 state variables × 15 actions) 39
40
Matthew E. Taylor Environment Agent Action S State S Reward S Environment Agent Action T State T Reward T Q-value Reuse Source Task Q-value function Target Task Q-value function 40
41
Matthew E. Taylor Environment Agent Action S State S Reward S Environment Agent Action T State T Reward T Target: LearnedSource: Fixed Q-value Reuse 41 Q(s target, a target ) = Q learned (s target, a target ) + Q source (χ X (s target ), χ A (a target ))
42
Matthew E. Taylor No Transfer Hand coded mappings Utilizing Mappings in 3D Mountain Car 42
43
Matthew E. Taylor Experimental Setup Learn in 2D Mountain Car for 100 episodes Learn in 3D Mountain Car for 25 episodes Apply MASTER –Train transition model off-line using backprop in Weka Transfer from 2D to 3D: Q-Value Reuse Learn the 3D Task 43
44
Matthew E. Taylor State Variable Mappings Evaluated xyMSE xxxx0.0348 xxx0.0228 xxx0.0227 xx0.0090 xxx0.0406 xx0.0350 xx0.0289 x0.0225 xxx0.0406 ……… 44
45
Matthew E. Taylor Action Mappings Evaluated Target Task ActionSource Task ActionMSE NeutralLeft0.0118 Neutral 0.0079 NeutralRight0.0103 WestLeft0.0095 WestNeutral0.0088 WestRight0.0127 EastLeft0.0144 EastNeutral0.0095 EastRight0.0089 ……… (-0.50, 0.01, Right, -0.49, 0.02) (-0.50, -0.50, 0.01, 0.01, East, -0.49, -0.49, 0.02, 0.02) (-0.50, -0.50, 0.01, 0.01, North, -0.49, -0.49, 0.02, 0.02) 45
46
Matthew E. Taylor Transfer in 3D Mountain Car 46 Average Both No Transfer Average Actions Hand-Coded 1/MSE
47
Matthew E. Taylor Transfer in 3D Mountain Car: Zoom 47 No Transfer Average Actions
48
Matthew E. Taylor MASTER Wrap-up First fully autonomous mapping-learning method Learning done off-line Use to select most relevant source task or transfer from multiple source tasks Future work –Incorporate heuristic search –Use in more complex domains –Formulate as optimization problem? 48
49
Matthew E. Taylor 49 Reinforcement Learning Background Inter-Task Mappings Value Function Transfer MASTER: Learning Inter-Task Mappings Related Work Future Work and Conclusion Outline
50
Matthew E. Taylor Related Work: Framework Allowed task differences Source task selection Type of knowledge transferred Allowed base learners + 3 others 50
51
Matthew E. Taylor Selected Related Work: Transfer Methods 1.Same state variables and actions [Selfridge+, 1985] 2.Multi-task learning [Fernandez and Veloso, 2006] 3.Methods to avoid inter-task mappings [Konidaris and Barto, 2007] 4.Different state variables and actions [Torrey+, ] 51 T(s, a)=s’ Action StateReward s = ⟨ x 1, … x n ⟩
52
Matthew E. Taylor Selected Related Work: Mapping Learning Methods On-line : Test possible mappings on-line as new actions [Soni and Singh, 2006] k-Armed bandit, each arm is a mapping [Talvite and Singh, 2007] Off-line Full Qualitative Dynamic Bayes Networks (QDBNs) [Liu and Stone, 2006] 52 Hold: 2 vs. 1 Keepaway Assume T types of task-independent objects Keepaway domain has 2 object types: Keepers and Takers Assume T types of task-independent objects Keepaway domain has 2 object types: Keepers and Takers
53
Matthew E. Taylor 53 Reinforcement Learning Background Inter-Task Mappings Value Function Transfer MASTER: Learning Inter-Task Mappings Related Work Future Work and Conclusion Outline
54
Matthew E. Taylor 54 Open Question 1: Optimize for Metrics Minimize target time: more source task training? Minimize total time: “moderate” amount of training? Depends on task similarity 3 vs. 2 to 4 vs. 3
55
Matthew E. Taylor –Is transfer beneficial for a given pair of tasks? Avoid Negative Transfer? Open Question 2: Effects of Task Similarity Transfer trivialTransfer impossible Source identical to Target Source unrelated to Target 55
56
Matthew E. Taylor Open Question 3: Avoiding Negative Transfer Currently depends on heuristics and human knowledge Very similar tasks may not transfer Need more theoretical analysis –Approximate bisimulation metrics? [Ferns et al., ] –Utilize homomorphisms? [Soni and Singh, 2006] 56
57
Matthew E. Taylor Acknowledgements Advisor: Peter Stone Committee: Risto Miikkulainen, Ray Mooney, Bruce Porter, and Rich Sutton Other co-authors for material in the dissertation: Nick Jong, Greg Kuhlmann, Shimon Whiteson, and Yaxin Liu LARG 57
58
Matthew E. Taylor 58 Conclusion Inter-task mappings can be: –Used with many different RL algorithms –Used in many domains –Learned from interacting with an environment Plausibility and efficacy have been demonstrated Next up: Broaden applicability and autonomy
59
Matthew E. Taylor 59 Thanks for your attention! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.