Download presentation
Presentation is loading. Please wait.
Published byBeatrice Matthews Modified over 9 years ago
1
Relational Macros for Transfer in Reinforcement Learning Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin University of Minnesota-Duluth, USA
2
Transfer Learning Scenario Agent learns Task A Agent encounters related Task B Agent recalls relevant knowledge from Task A Agent uses this knowledge to learn Task B quickly
3
Goals of Transfer Learning Learning curves in the target task: performance training with transfer without transfer
4
Reinforcement Learning Take an action Observe world state Receive a reward Policy: choose the action with the highest Q-value in the current state Use the rewards to estimate the Q- values of actions in states Described by a set of features
5
The RoboCup Domain 2-on-1 BreakAway 3-on-2 BreakAway 4-on-3 BreakAway
6
Transfer in Reinforcement Learning Related work Related work Model reuse (Taylor & Stone 2005) Model reuse (Taylor & Stone 2005) Policy reuse (Fernandez & Veloso 2006) Policy reuse (Fernandez & Veloso 2006) Option transfer (Perkins & Precup 1999) Option transfer (Perkins & Precup 1999) Relational RL (Driessens et al. 2006) Relational RL (Driessens et al. 2006) Our previous work Our previous work Policy transfer (Torrey et al. 2005) Policy transfer (Torrey et al. 2005) Skill transfer (Torrey et al. 2006) Skill transfer (Torrey et al. 2006) Now we learn a strategy instead of individual skills Copy the Q-function Learn rules that describe when to take individual actions
7
Representing a Multi-step Strategy A relational macro is a finite-state machine A relational macro is a finite-state machine Nodes represent internal states of agent in which limited independent policies apply Nodes represent internal states of agent in which limited independent policies apply Conditions for transitions and actions are in first-order logic Conditions for transitions and actions are in first-order logic Really these are rule sets, not just single rules hold ← true pass(Teammate) ← isOpen(Teammate) isClose(Opponent) allOpponentsFar The learning agent jumps between players
8
Our Proposed Method Learn a relational macro that describes a successful strategy in the source task Learn a relational macro that describes a successful strategy in the source task Execute the macro in the target task to demonstrate the successful strategy Execute the macro in the target task to demonstrate the successful strategy Continue learning the target task with standard RL after the demonstration Continue learning the target task with standard RL after the demonstration
9
Learning a Relational Macro We use ILP to learn macros We use ILP to learn macros Aleph: top-down search in a bottom clause Aleph: top-down search in a bottom clause Heuristic and randomized search Heuristic and randomized search Maximize F1 score Maximize F1 score We learn a macro in two phases We learn a macro in two phases The action sequence (node structure) The action sequence (node structure) The rule sets for actions and transitions The rule sets for actions and transitions
10
Learning Macro Structure Objective: find an action pattern that separates good and bad games Objective: find an action pattern that separates good and bad games macroSequence(Game) ← actionTaken(Game, StateA, move, ahead, StateB), actionTaken(Game, StateB, pass, _, StateC), actionTaken(Game, StateC, shoot, _, gameEnd). pass(Teammate) move(ahead) shoot(GoalPart)
11
Learning Macro Conditions Objective: describe when transitions and actions should be taken Objective: describe when transitions and actions should be taken For the transition from move to pass transition(State) ← feature(State, distance(Teammate, goal)) < 15. For the policy in the pass node action(State, pass(Teammate)) ← feature(State, angle(Teammate, me, Opponent)) > 30. pass(Teammate) move(ahead) shoot(GoalPart)
12
Examples for Actions Game 1: move(ahead) pass(a1)shoot(goalRight) Game 2: move(ahead) pass(a2)shoot(goalLeft) Game 3: move(right) pass(a1) Game 4: move(ahead) pass(a1)shoot(goalRight) scoring non-scoring positive negativ e pass(Teammate) move(ahead) shoot(GoalPart)
13
Examples for Transitions Game 1: move(ahead) pass(a1)shoot(goalRight) Game 2: move(ahead) move(ahead)shoot(goalLeft) Game 3: move(ahead) pass(a1)shoot(goalRight) scoring non-scoring positive negativ e pass(Teammate) move(ahead) shoot(GoalPart)
14
Transferring a Macro Demonstration Demonstration Execute the macro strategy to get Q-value estimates Execute the macro strategy to get Q-value estimates Infer low Q-values for actions not taken by macro Infer low Q-values for actions not taken by macro Compute an initial Q-function with these examples Compute an initial Q-function with these examples Continue learning with standard RL Continue learning with standard RL Advantage: potential for large immediate jump in performance Advantage: potential for large immediate jump in performance Disadvantage: risk that agent will blindly follow an inappropriate strategy Disadvantage: risk that agent will blindly follow an inappropriate strategy
15
Experiments Source task: 2-on-1 BreakAway Source task: 2-on-1 BreakAway 3000 existing games from the learning curve 3000 existing games from the learning curve Learn macros from 5 separate runs Learn macros from 5 separate runs Target tasks: 3-on-2 and 4-on-3 BreakAway Target tasks: 3-on-2 and 4-on-3 BreakAway Demonstration period of 100 games Demonstration period of 100 games Continue training up to 3000 games Continue training up to 3000 games Perform 5 target runs for each source run Perform 5 target runs for each source run
16
2-on-1 BreakAway Macro pass(Teammate) move(Direction) shoot(goalRight) shoot(goalLeft) In one source run this node was absent The ordering of these nodes varied This shot is apparently a leading pass The learning agent jumped players here
17
Results: 2-on-1 to 3-on-2
18
Results: 2-on-1 to 4-on-3
19
Conclusions This transfer method can significantly improve initial target-task performance This transfer method can significantly improve initial target-task performance It can handle new elements being added to the target task, but not new objectives It can handle new elements being added to the target task, but not new objectives It is an aggressive approach that is a good choice for tasks with similar strategies It is an aggressive approach that is a good choice for tasks with similar strategies
20
Future Work Alternative ways to apply relational macros in the target task Alternative ways to apply relational macros in the target task Keep the initial benefits Keep the initial benefits Alleviate risks when tasks differ more Alleviate risks when tasks differ more Alternative ways to make decisions about steps within macros Alternative ways to make decisions about steps within macros Statistical relational learning techniques Statistical relational learning techniques
21
Acknowledgements DARPA Grant HR0011-04-1-0007 DARPA Grant HR0011-04-1-0007 DARPA IPTO contract FA8650-06-C-7606 DARPA IPTO contract FA8650-06-C-7606 Thank You
22
Rule scores Each transition and action has a set of rules, one or more of which may fire Each transition and action has a set of rules, one or more of which may fire If multiple rules fire, we obey the one with the highest score If multiple rules fire, we obey the one with the highest score The score of a rule is the probability that following it leads to a successful game The score of a rule is the probability that following it leads to a successful game Score = # source-task games that followed the rule and scored Score = # source-task games that followed the rule and scored # source-task games that followed the rule # source-task games that followed the rule
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.