Download presentation
Presentation is loading. Please wait.
Published byEgbert Mason Modified over 9 years ago
1
Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009
2
GivenLearn Task T Task S
3
Environment s1s1 Agent Q(s 1, a) = 0 policy π(s 1 ) = a 1 a1a1 s2s2 r2r2 δ(s 1, a 1 ) = s 2 r(s 1, a 1 ) = r 2 Q(s 1, a 1 ) Q(s 1, a 1 ) + Δ π(s 2 ) = a 2 a2a2 δ(s 2, a 2 ) = s 3 r(s 2, a 2 ) = r 3 s3s3 r3r3 ExplorationExploitation Maximize reward Reference: Sutton and Barto, Reinforcement Learning: An Introduction, MIT Press 1998
4
performance training higher start higher slope higher asymptote
5
3-on-2 BreakAway 3-on-2 KeepAway 3-on-2 MoveDownfield 2-on-1 BreakAway Q a (s) = w 1 f 1 + w 2 f 2 + w 3 f 3 + … Hand-coded defenders Single learning agent
6
Starting-point methods Imitation methods Hierarchical methods Alteration methods New RL algorithms
7
pass(t 1 ) pass(t 2 ) pass(Teammate) Opponent 1 Opponent 2 IF feature(Opponent) THEN
8
Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm ECML 2006 (ECML 2005) Macro transfer Macro-operators Demonstration Macro-transfer algorithm ILP 2007 Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm AAAI workshop 2008 MLN policy-transfer algorithm ILP 2009
9
Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm Macro transfer Macro-operators Demonstration Macro-transfer algorithm Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm MLN policy-transfer algorithm
10
IF these conditions hold THEN pass is the best action
11
Try what worked in a previous task!
12
Batch Reinforcement Learning via Support Vector Regression (RL-SVR) Environment Agent Batch 1 Environment Agent Batch 2 … Compute Q-functions Find Q-functions that minimize:ModelSize + C × DataMisfit (one per action)
13
Find Q-functions that minimize:ModelSize + C × DataMisfit Batch Reinforcement Learning with Advice (KBKR) Environment Agent Batch 1 Compute Q-functions Environment Agent Batch 2 … Advice + µ × AdviceMisfit Robust to negative transfer!
14
IF [ ] THEN pass(Teammate) IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 15 THEN pass(Teammate) IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 30 THEN pass(Teammate) IF distance(Teammate) ≤ 5 THEN pass(Teammate) IF distance(Teammate) ≤ 10 THEN pass(Teammate) … F(β) = (1+ β 2 ) × Precision × Recall (β 2 × Precision) + Recall Reference: De Raedt, Logical and Relational Learning, Springer 2008
15
Source Target IF distance(Teammate) ≤ 5 angle(Teammate, Opponent) ≥ 30 THEN pass(Teammate) ILP Advice Taking
16
Skill transfer from 3-on-2 MoveDownfield to 4-on-3 MoveDownfield IFdistance(me, Teammate) ≥ 15 distance(me, Teammate) ≤ 27 distance(Teammate, rightEdge) ≤ 10 angle(Teammate, me, Opponent) ≥ 24 distance(me, Opponent) ≥ 4 THEN pass(Teammate)
17
Skill transfer from several tasks to 3-on-2 BreakAway Torrey et al. ECML 2006
18
Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm Macro transfer Macro-operators Demonstration Macro-transfer algorithm Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm MLN policy-transfer algorithm
19
pass(Teammate) move(Direction) shoot(goalRight) shoot(goalLeft) IF [... ] THEN pass(Teammate) IF [... ] THEN move(ahead) IF [... ] THEN shoot(goalRight) IF [... ] THEN shoot(goalLeft) IF [... ] THEN pass(Teammate) IF [... ] THEN move(left) IF [... ] THEN shoot(goalRight) IF [... ] THEN shoot(goalRight)
20
source target target-task training policy used No more protection against negative transfer! But… best-case scenario could be very good.
21
Source Target ILP Demonstration
22
Learning structures Positive: BreakAway games that score Negative: BreakAway games that didn’t score ILP IF actionTaken(Game, StateA, pass(Teammate), StateB) actionTaken(Game, StateB, move(Direction), StateC) actionTaken(Game, StateC, shoot(goalRight), StateD) actionTaken(Game, StateD, shoot(goalLeft), StateE) THEN isaGoodGame(Game)
23
Learning rules for arcs Positive: states in good games that took the arc Negative: states in good games that could have taken the arc but didn’t ILP shoot(goalRight) IF [ … ] THEN enter(State) IF [ … ] THEN loop(State, Teammate)) pass(Teammate)
24
Selecting and scoring rules Rule 1Precision=1.0 Rule 2Precision=0.99 Rule3Precision=0.96… Does rule increase F(10) of ruleset? yes Add to ruleset Rule score = # games that follow the rule that are good # games that follow the rule
25
Macro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway pass(Teammate) move(ahead) pass(Teammate) move(right) shoot(goalLeft) move(right) move(left) shoot(goalLeft) shoot(goalRight) move(left) shoot(goalLeft) shoot(goalRight) move(ahead) move(right) shoot(goalLeft) shoot(goalRight) move(away) shoot(goalLeft)shoot(goalRight) move(right) shoot(goalLeft) shoot(goalRight) shoot(goalLeft) shoot(goalRight) shoot(GoalPart)
26
Macro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway Torrey et al. ILP 2007
27
Macro self-transfer in 2-on-1 BreakAway Probability of goal Training games Asymptote 56% Initial 1% Single macro 32% Multiple macro 43%
28
Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm Macro transfer Macro-operators Demonstration Macro-transfer algorithm Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm MLN policy-transfer algorithm
29
Formulas (F) evidence 1 (X) AND query(X) evidence 2 (X) AND query(X) Weights (W) w 0 = 1.1 w 1 = 0.9 n i (world) = # true groundings of i th formula in world query(x 1 ) e1 e2 … query(x 2 ) e1 e2 … Reference: Richardson and Domingos, Markov Logic Networks, Machine Learning 2006
30
IF [... ] THEN … Alchemy weight learning w 0 = 1.1 From ILP: MLN: Reference: http://alchemy.cs.washington.edu
31
IF distance(Teammate, goal) < 12 THEN pass(Teammate) IF angle(Teammate, defender) > 30 THEN pass(Teammate) Matches t 1, score=0.92 Matches t 2, score=0.88 pass(Teammate) MLN P(t1) = 0.35 P(t2) = 0.65
32
pass(Teammate) AND angle(Teammate, defender) > 30 pass(Teammate) AND distance(Teammate, goal) < 12 pass(t 1 ) pass(t 2 ) distance(t 1, goal) < 12 distance(t 2, goal) < 12 angle(t 2, defender) > 30 angle(t 1, defender) > 30
33
Macro transfer from 2-on-1 BreakAway to 3-on-2 BreakAway
34
Macro self-transfer in 2-on-1 BreakAway Probability of goal Training games Asymptote 56% Initial 1% Regular macro 32% Macro with MLN 43%
35
Source Target ILP, Alchemy Demonstration MLN for action 1 StateQ-value MLN Q-functions MLN for action 2 StateQ-value …
36
0 ≤ Q a < 0.20.2 ≤ Q a < 0.40.4 ≤ Q a < 0.6 ……… … Bin Number Probability Bin Number Probability Bin Number Probability
37
MLN Q-function transfer from 2-on-1 BreakAway to 3-on-2 BreakAway IFdistance(me, GoalPart) ≥ 42 distance(me, Teammate) ≥ 39 THEN pass(Teammate) falls into [0, 0.11] IFangle(topRight, goalCenter, me) ≤ 42 angle(topRight, goalCenter, me) ≥ 55 angle(goalLeft, me, goalie) ≥ 20 angle(goalCenter, me, goalie) ≤ 30 THEN pass(Teammate) falls into [0.11, 0.27] IFdistance(Teammate, goalCenter) ≤ 9 angle(topRight, goalCenter, me) ≤ 85 THEN pass(Teammate) falls into [0.27, 0.43]
38
MLN Q-function transfer from 2-on-1 BreakAway to 3-on-2 BreakAway Torrey et al. AAAI workshop 2008
39
Source Target ILP, Alchemy Demonstration MLN (F,W) State Action Probability MLN Policy
40
move(ahead)pass(Teammate)shoot(goalLeft) ……… … Policy = highest-probability action
41
MLN policy transfer from 2-on-1 BreakAway to 3-on-2 BreakAway IFangle(topRight, goalCenter, me) ≤ 70 timeLeft ≥ 98 distance(me, Teammate) ≥ 3 THEN pass(Teammate) IFdistance(me, GoalPart) ≥ 36 distance(me, Teammate) ≥ 12 timeLeft ≥ 91 angle(topRight, goalCenter, me) ≤ 80 THEN pass(Teammate) IFdistance(me, GoalPart) ≥ 27 angle(topRight, goalCenter, me) ≤ 75 distance(me, Teammate) ≥ 9 angle(Teammate, me, goalie) ≥ 25 THEN pass(Teammate)
42
MLN policy transfer from 2-on-1 BreakAway to 3-on-2 BreakAway Torrey et al. ILP 2009
43
MLN self-transfer in 2-on-1 BreakAway Probability of goal Training games Asymptote 56% Initial 1% MLN Q-function 59% MLN Policy 65%
44
Advice transfer Advice taking Inductive logic programming Skill-transfer algorithm ECML 2006 (ECML 2005) Macro transfer Macro-operators Demonstration Macro-transfer algorithm ILP 2007 Markov Logic Network transfer Markov Logic Networks MLNs in macros MLN Q-function transfer algorithm AAAI workshop 2008 MLN policy-transfer algorithm ILP 2009
45
Starting-point Taylor et al. 2005: Value-function transfer Imitation Fernandez and Veloso 2006: Policy reuse Hierarchical Mehta et al. 2008: MaxQ transfer Alteration Walsh et al. 2006: Aggregate states New Algorithms Sharma et al. 2007: Case-based RL
46
Transfer can improve reinforcement learning Initial performance Learning speed Advice transfer Low initial performance Steep learning curves Robust to negative transfer Macro transfer and MLN transfer High initial performance Shallow learning curves Vulnerable to negative transfer
47
Close-transfer scenarios Distant-transfer scenarios Multiple Macro MLN Policy Single Macro MLN Q-Function Skill Transfer == ≥ ≥ Multiple Macro MLN Policy Single Macro MLN Q-Function Skill Transfer == ≥ ≥
48
Task T Multiple source tasks Task S1
49
Theoretical results How high can the initial performance be? How quickly can the target-task learner improve? How many episodes are “saved” through transfer? SourceTarget Relationship?
50
Joint learning and inference in macros Single search Combined rule/weight learning pass(Teammate) move(Direction)
51
Refinement of transferred knowledge Macros Revising rule scores Relearning rules Relearning structure MLNs Revising weights Relearning rules Too-specific clause Better clause Too-general clause Better clause (Mihalkova et. al 2007)
52
Relational reinforcement learning Q-learning with MLN Q-function Policy search with MLN policies or macro Bin Number Probability MLN Q-functions lose too much information:
53
Diverse tasks Complex testbeds Automated mapping Protection against negative transfer General challenges in RL transfer
54
Advisor: Jude Shavlik Collaborators: Trevor Walker and Richard Maclin Committee David Page Mark Craven Jerry Zhu Michael Coen UW Machine Learning Group Grants DARPA HR0011-04-1-0007 NRL N00173-06-1-G002 DARPA FA8650-06-C-7606
56
0000 0000 0000 target-task training 2548 9172 5914 Initial Q-table transfer no transfer Source task Starting-point methods
57
Imitation methods training source target policy used
58
Hierarchical methods RunKick Pass Shoot Soccer
59
Alteration methods Task S Original states Original actions Original rewards New states New actions New rewards
60
Source Target IF Q(pass(Teammate)) > Q(other) THEN pass(Teammate) Advice Taking
61
action = pass(X) ? outcome = caught(X) ? pass(X) good? pass(X) clearly best? some action good? pass(X) clearly bad? Positive example for pass(X) Negative example for pass(X) yes no yes Reject example no
62
Exact Inference x 1 = world where pass(t 1 ) is truex 0 = world where pass(t 1 ) is false Note: when pass(t 1 ) is false no formulas are true pass(t 1 ) AND angle(t 1, defender) > 30 pass(t 1 ) AND distance(t 1, goal) < 12
63
Exact Inference
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.