Relational Macros for Transfer in Reinforcement Learning Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin University.

Slides:



Advertisements
Similar presentations
Galit Haim, Ya'akov Gal, Sarit Kraus and Michele J. Gelfand A Cultural Sensitive Agent for Human-Computer Negotiation 1.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
Markov Decision Process
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Background Reinforcement Learning (RL) agents learn to do tasks by iteratively performing actions in the world and using resulting experiences to decide.
Reinforcement learning (Chapter 21)
Reinforcement learning
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
Soar-RL: Reinforcement Learning and Soar Shelley Nason.
Machine Learning via Advice Taking Jude Shavlik. Thanks To... Rich Maclin Lisa Torrey Trevor Walker Prof. Olvi Mangasarian Glenn Fung Ted Wild DARPA.
Reinforcement Learning (1)
© Jesse Davis 2006 View Learning Extended: Learning New Tables Jesse Davis 1, Elizabeth Burnside 1, David Page 1, Vítor Santos Costa 2 1 University of.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.
A Study of Computational and Human Strategies in Revelation Games 1 Noam Peled, 2 Kobi Gal, 1 Sarit Kraus 1 Bar-Ilan university, Israel. 2 Ben-Gurion university,
Lisa Torrey University of Wisconsin – Madison CS 540.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Search and Planning for Inference and Learning in Computer Vision
Planning and Verification for Stochastic Processes with Asynchronous Events Håkan L. S. Younes Carnegie Mellon University.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Lisa Torrey and Jude Shavlik University of Wisconsin Madison WI, USA.
Introduction Many decision making problems in real life
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another Lisa Torrey, Trevor Walker, Jude Shavlik University of Wisconsin-Madison,
Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009.
Skill Acquisition via Transfer Learning and Advice Taking Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin.
Transfer Learning Via Advice Taking Jude Shavlik University of Wisconsin-Madison.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison,
1 CSC 8520 Spring Paula Matuszek Kinds of Machine Learning Machine learning techniques can be grouped into several categories, in several ways: –What.
Learning from Observations Chapter 18 Through
Lecture about Agents that Learn 3rd April 2000 INT4/2I1235.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
Speeding Up Relational Data Mining by Learning to Estimate Candidate Hypothesis Scores Frank DiMaio and Jude Shavlik UW-Madison Computer Sciences ICDM.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Tuffy Scaling up Statistical Inference in Markov Logic using an RDBMS
CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Advice Taking and Transfer Learning: Naturally-Inspired Extensions to Reinforcement Learning Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik.
Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.
POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.
GAME PLAYING 1. There were two reasons that games appeared to be a good domain in which to explore machine intelligence: 1.They provide a structured task.
Neural Network Implementation of Poker AI
INTRODUCTION TO Machine Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning
Finite State Machines (FSM) OR Finite State Automation (FSA) - are models of the behaviors of a system or a complex object, with a limited number of defined.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Announcements Homework 3 due today (grace period through Friday)
Louis Oliphant and Jude Shavlik
Dr. Unnikrishnan P.C. Professor, EEE
Richard Maclin University of Minnesota - Duluth
Introduction to Reinforcement Learning and Q-Learning
Emir Zeylan Stylianos Filippou
Reinforcement Learning (2)
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

Relational Macros for Transfer in Reinforcement Learning Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin University of Minnesota-Duluth, USA

Transfer Learning Scenario Agent learns Task A Agent encounters related Task B Agent recalls relevant knowledge from Task A Agent uses this knowledge to learn Task B quickly

Goals of Transfer Learning Learning curves in the target task: performance training with transfer without transfer

Reinforcement Learning Take an action Observe world state Receive a reward Policy: choose the action with the highest Q-value in the current state Use the rewards to estimate the Q- values of actions in states Described by a set of features

The RoboCup Domain 2-on-1 BreakAway 3-on-2 BreakAway 4-on-3 BreakAway

Transfer in Reinforcement Learning Related work Related work Model reuse (Taylor & Stone 2005) Model reuse (Taylor & Stone 2005) Policy reuse (Fernandez & Veloso 2006) Policy reuse (Fernandez & Veloso 2006) Option transfer (Perkins & Precup 1999) Option transfer (Perkins & Precup 1999) Relational RL (Driessens et al. 2006) Relational RL (Driessens et al. 2006) Our previous work Our previous work Policy transfer (Torrey et al. 2005) Policy transfer (Torrey et al. 2005) Skill transfer (Torrey et al. 2006) Skill transfer (Torrey et al. 2006) Now we learn a strategy instead of individual skills Copy the Q-function Learn rules that describe when to take individual actions

Representing a Multi-step Strategy A relational macro is a finite-state machine A relational macro is a finite-state machine Nodes represent internal states of agent in which limited independent policies apply Nodes represent internal states of agent in which limited independent policies apply Conditions for transitions and actions are in first-order logic Conditions for transitions and actions are in first-order logic Really these are rule sets, not just single rules hold ← true pass(Teammate) ← isOpen(Teammate) isClose(Opponent) allOpponentsFar The learning agent jumps between players

Our Proposed Method Learn a relational macro that describes a successful strategy in the source task Learn a relational macro that describes a successful strategy in the source task Execute the macro in the target task to demonstrate the successful strategy Execute the macro in the target task to demonstrate the successful strategy Continue learning the target task with standard RL after the demonstration Continue learning the target task with standard RL after the demonstration

Learning a Relational Macro We use ILP to learn macros We use ILP to learn macros Aleph: top-down search in a bottom clause Aleph: top-down search in a bottom clause Heuristic and randomized search Heuristic and randomized search Maximize F1 score Maximize F1 score We learn a macro in two phases We learn a macro in two phases The action sequence (node structure) The action sequence (node structure) The rule sets for actions and transitions The rule sets for actions and transitions

Learning Macro Structure Objective: find an action pattern that separates good and bad games Objective: find an action pattern that separates good and bad games macroSequence(Game) ← actionTaken(Game, StateA, move, ahead, StateB), actionTaken(Game, StateB, pass, _, StateC), actionTaken(Game, StateC, shoot, _, gameEnd). pass(Teammate) move(ahead) shoot(GoalPart)

Learning Macro Conditions Objective: describe when transitions and actions should be taken Objective: describe when transitions and actions should be taken For the transition from move to pass transition(State) ← feature(State, distance(Teammate, goal)) < 15. For the policy in the pass node action(State, pass(Teammate)) ← feature(State, angle(Teammate, me, Opponent)) > 30. pass(Teammate) move(ahead) shoot(GoalPart)

Examples for Actions Game 1: move(ahead) pass(a1)shoot(goalRight) Game 2: move(ahead) pass(a2)shoot(goalLeft) Game 3: move(right) pass(a1) Game 4: move(ahead) pass(a1)shoot(goalRight) scoring non-scoring positive negativ e pass(Teammate) move(ahead) shoot(GoalPart)

Examples for Transitions Game 1: move(ahead) pass(a1)shoot(goalRight) Game 2: move(ahead) move(ahead)shoot(goalLeft) Game 3: move(ahead) pass(a1)shoot(goalRight) scoring non-scoring positive negativ e pass(Teammate) move(ahead) shoot(GoalPart)

Transferring a Macro Demonstration Demonstration Execute the macro strategy to get Q-value estimates Execute the macro strategy to get Q-value estimates Infer low Q-values for actions not taken by macro Infer low Q-values for actions not taken by macro Compute an initial Q-function with these examples Compute an initial Q-function with these examples Continue learning with standard RL Continue learning with standard RL Advantage: potential for large immediate jump in performance Advantage: potential for large immediate jump in performance Disadvantage: risk that agent will blindly follow an inappropriate strategy Disadvantage: risk that agent will blindly follow an inappropriate strategy

Experiments Source task: 2-on-1 BreakAway Source task: 2-on-1 BreakAway 3000 existing games from the learning curve 3000 existing games from the learning curve Learn macros from 5 separate runs Learn macros from 5 separate runs Target tasks: 3-on-2 and 4-on-3 BreakAway Target tasks: 3-on-2 and 4-on-3 BreakAway Demonstration period of 100 games Demonstration period of 100 games Continue training up to 3000 games Continue training up to 3000 games Perform 5 target runs for each source run Perform 5 target runs for each source run

2-on-1 BreakAway Macro pass(Teammate) move(Direction) shoot(goalRight) shoot(goalLeft) In one source run this node was absent The ordering of these nodes varied This shot is apparently a leading pass The learning agent jumped players here

Results: 2-on-1 to 3-on-2

Results: 2-on-1 to 4-on-3

Conclusions This transfer method can significantly improve initial target-task performance This transfer method can significantly improve initial target-task performance It can handle new elements being added to the target task, but not new objectives It can handle new elements being added to the target task, but not new objectives It is an aggressive approach that is a good choice for tasks with similar strategies It is an aggressive approach that is a good choice for tasks with similar strategies

Future Work Alternative ways to apply relational macros in the target task Alternative ways to apply relational macros in the target task Keep the initial benefits Keep the initial benefits Alleviate risks when tasks differ more Alleviate risks when tasks differ more Alternative ways to make decisions about steps within macros Alternative ways to make decisions about steps within macros Statistical relational learning techniques Statistical relational learning techniques

Acknowledgements DARPA Grant HR DARPA Grant HR DARPA IPTO contract FA C-7606 DARPA IPTO contract FA C-7606 Thank You

Rule scores Each transition and action has a set of rules, one or more of which may fire Each transition and action has a set of rules, one or more of which may fire If multiple rules fire, we obey the one with the highest score If multiple rules fire, we obey the one with the highest score The score of a rule is the probability that following it leads to a successful game The score of a rule is the probability that following it leads to a successful game Score = # source-task games that followed the rule and scored Score = # source-task games that followed the rule and scored # source-task games that followed the rule # source-task games that followed the rule