Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison, USA
Possible Benefits of Transfer in RL Learning curves in the target task: performance training with transfer without transfer
The RoboCup Domain 2-on-1 BreakAway 3-on-2 BreakAway
Reinforcement Learning Environment Agent action reward state distance(me,teammate1) = 15 distance(me,opponent1) = 5 angle(opponent1, me, teammate1) = 30 … States are described by features: Move Pass Shoot Actions are: +1 for scoring 0 otherwise Rewards are:
Our Previous Methods Skill transfer Skill transfer Learn a rule for when to take each action Learn a rule for when to take each action Use rules as advice Use rules as advice Macro transfer Macro transfer Learn a relational multi-step action plan Learn a relational multi-step action plan Use macro to demonstrate Use macro to demonstrate
Transfer via Markov Logic Networks MLN Q-function Analyze Target-task learner MLN Q-function Demonstrate Source-task learner Learn Source-task Q-function and data
Markov Logic Networks A Markov network models a joint distribution A Markov network models a joint distribution A Markov Logic Network combines probability with logic A Markov Logic Network combines probability with logic Template: a set of first-order formulas with weights Template: a set of first-order formulas with weights Each grounded predicate in a formula becomes a node Each grounded predicate in a formula becomes a node Predicates in grounded formula are connected by arcs Predicates in grounded formula are connected by arcs Probability of a world: (1/Z) exp( Σ W i N i ) Probability of a world: (1/Z) exp( Σ W i N i ) Richardson and Domingos, ML 2006 X Y Z A B
MLN Q-function IF distance(me, Teammate) < 15 AND angle(me, goalie, Teammate) > 45 THEN Q є (0.8, 1.0) IF distance(me, GoalPart) < 10 AND angle(me, goalie, GoalPart) > 45 THEN Q є (0.8, 1.0) Formula 1 W 1 = 0.75 N 1 = 1 teammate Formula 2 W 1 = 1.33 N 1 = 3 goal parts Probability that Q є (0.8, 1.0): __exp(W 1 N 1 + W 1 N 1 )__ 1 + exp(W 1 N 1 + W 1 N 1 )
Grounded Markov Network Q є (0.8, 1.0) distance(me, teammate1) < 15 angle(me, goalie, teammate1) > 45 distance(me, goalRight) < 10 angle(me, goalie, goalRight) > 45 distance(me, goalLeft) < 10 angle(me, goalie, goalLeft) > 45
Learning an MLN Find good Q-value bins using hierarchical clustering Find good Q-value bins using hierarchical clustering Learn rules that classify examples into bins using inductive logic programming Learn rules that classify examples into bins using inductive logic programming Learn weights for these formulas to produce the final MLN Learn weights for these formulas to produce the final MLN
Binning via Hierarchical Clustering Frequency Q-value Frequency Q-value Frequency Q-value
Classifying Into Bins via ILP Given examples Given examples Positive: inside this Q-value bin Positive: inside this Q-value bin Negative: outside this Q-value bin Negative: outside this Q-value bin The Aleph* ILP learning system finds rules that separate positive from negative The Aleph* ILP learning system finds rules that separate positive from negative Builds rules one predicate at a time Builds rules one predicate at a time Top-down search through the feature space Top-down search through the feature space * Srinivasan, 2001
Learning Formula Weights Given formulas and examples Given formulas and examples Same examples as for ILP Same examples as for ILP ILP rules as network structure ILP rules as network structure Alchemy* finds weights that make the probability estimates accurate Alchemy* finds weights that make the probability estimates accurate Scaled conjugate-gradient algorithm Scaled conjugate-gradient algorithm * Kok, Singla, Richardson, Domingos, Sumner, Poon and Lowd,
Using an MLN Q-function Q є (0.8, 1.0) P 1 = 0.75 Q є (0.5, 0.8) P 2 = 0.15 Q є (0, 0.5) P 2 = 0.10 Q = P 1 ● E [Q | bin1] + P 2 ● E [Q | bin2] + P 3 ● E [Q | bin3] Q-value of most similar training example in bin
Example Similarity 1 1 E [Q | bin] = Q-value of most similar training example in bin Similarity = dot product of example vectors Example vector shows which bin rules the example satisfies Rule 1 Rule 2 Rule 3 …
Experiments Source task: 2-on-1 BreakAway Source task: 2-on-1 BreakAway 3000 existing games from the learning curve 3000 existing games from the learning curve Learn MLNs from 5 separate runs Learn MLNs from 5 separate runs Target task: 3-on-2 BreakAway Target task: 3-on-2 BreakAway Demonstration period of 100 games Demonstration period of 100 games Continue training up to 3000 games Continue training up to 3000 games Perform 5 target runs for each source run Perform 5 target runs for each source run
Discoveries Results can vary widely with the source- task chunk from which we transfer Results can vary widely with the source- task chunk from which we transfer Most methods use the “final” Q-function from the last chunk Most methods use the “final” Q-function from the last chunk MLN transfer performs better from chunks halfway through the learning curve MLN transfer performs better from chunks halfway through the learning curve
Results in 3-on-2 BreakAway
Conclusions MLN transfer can significantly improve initial target-task performance MLN transfer can significantly improve initial target-task performance Like macro transfer, it is an aggressive approach for tasks with similar strategies Like macro transfer, it is an aggressive approach for tasks with similar strategies It “lifts” transferred information to first-order logic, making it more general for transfer It “lifts” transferred information to first-order logic, making it more general for transfer Theory refinement in the target task may be viable through MLN revision Theory refinement in the target task may be viable through MLN revision
Potential Future Work Model screening for transfer learning Model screening for transfer learning Theory refinement in the target task Theory refinement in the target task Fully relational RL in RoboCup using MLNs as Q-function approximators Fully relational RL in RoboCup using MLNs as Q-function approximators
Acknowledgements DARPA Grant HR C-0060 DARPA Grant HR C-0060 DARPA Grant FA C-7606 DARPA Grant FA C-7606 Thank You