Multi-Agent Shared Hierarchy Reinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University.

Slides:

Advertisements

Similar presentations

A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School.

Advertisements

Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]

Hierarchical Reinforcement Learning Amir massoud Farahmand

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Markov Decision Process

Lecture 8: Three-Level Architectures CS 344R: Robotics Benjamin Kuipers.

Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts.

Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University.

A Hierarchical Framework for Composing Nested Web Processes Haibo Zhao, Prashant Doshi LSDIS Lab, Dept. of Computer Science, University of Georgia 4 th.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Generalizing Plans to New Environments in Relational MDPs

STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.

Reinforcement learning

Analyzing the tradeoffs between breakup and cloning in the context of organizational self-design By Sachin Kamboj.

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.

Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei.

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Reinforcement Learning and Soar Shelley Nason. Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Option and Constraint Generation using Work Domain Analysis Presenter: Guliz Tokadli Dr. Karen Feigh.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Hierarchical Reinforcement Learning Ronald Parr Duke University ©2005 Ronald Parr From ICML 2005 Rich Representations for Reinforcement Learning Workshop.

DARPA Mobile Autonomous Robot SoftwareLeslie Pack Kaelbling; March Adaptive Intelligent Mobile Robotics Leslie Pack Kaelbling Artificial Intelligence.

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.

Haley: A Hierarchical Framework for Logical Composition of Web Services Haibo Zhao, Prashant Doshi LSDIS Lab, Dept. of Computer Science, University of.

By: Messias, Spaan, Lima Presented by: Mike Plasker DMES – Ocean Engineering.

Solving POMDPs through Macro Decomposition

Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.

1 Checking Interaction Consistency in MARMOT Component Refinements Yunja Choi School of Electrical Engineering and Computer Science Kyungpook National.

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.

Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.

MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

RADHA-KRISHNA BALLA 19 FEBRUARY, 2009 UCT for Tactical Assault Battles in Real-Time Strategy Games.

Adaptive Reinforcement Learning Agents in RTS Games Eric Kok.

Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Reinforcement learning (Chapter 21)

Markov Decision Processes

Structure in Reinforcement Learning

Concurrent Hierarchical Reinforcement Learning

Hierarchical POMDP Solutions

SNU BioIntelligence Lab.

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

Multi-Agent Shared Hierarchy Reinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University

2 Highlights  Sharing value functions  Coordination  Framework to express sharing & coordination with hierarchies  RTS domain

3 Previous Work  MAXQ, Options, ALisp  Coordination in the hierarchical setting (Makar, Mahadevan)  Sharing flat value functions (Tan)  Concurrent reinforcement learning for multiple effectors (Murthi, Russell, …)

4 Outline  Average Reward Learning  RTS domain  Hierarchical ARL  MASH framework  Experimental results  Conclusion & future work

5 SMDP  Semi-Markov Decision Process (SMDP) extends MDPs by allowing for temporally extended actions –States S –Actions A –Transition function P(s’, N|s, a) –Reward function R(s’|s, a) –Time function T(s’|s, a)  Given an SMDP, an agent in state s following policy , G a i n½ ¼ ( s ) = l i m N ! 1 E ( P N i = 0 r i ) E ( P N i = 0 t i )

6 Average Reward Learning  Taking action a in state s –Immediate reward r(s, a) –Action duration t(s, a)  Average-adjusted reward =  Optimal policy  * maximizes the RHS, and leads to the optimal gain h ¼ ( s 0 ) = E [( r ( s 0 ; a ) ¡ ½ t ( s 0 ; a )) + ( r ( s 1 ; a ) ¡ ½ t ( s 1 ; a )) + ¢¢¢ ] )h ¼ ( s 0 ) = E [ r ( s 0 ; a ) ¡ ½ t ( s 0 ; a )] + h ¼ ( s 1 ) s 0 s 1 s 2 s n s 0 s n r ( s 0 ; a 0 ) ¡ ½ t ( s 0 ; a 0 ) r ( s 0 ; a 1 ) ¡ ½ t ( s 0 ; a 1 ) r ( s 0 ; a 2 ) ¡ ½ t ( s 0 ; a 2 ) Parent task Child task r ( s ; a ) ¡ ½ ¼ t ( s ; a ) ½ ¼ ¤ ¸ ½ ¼

7 RTS Domain  Grid world domain  Multiple peasants mine resources (wood, gold) to replenish the home stock  Avoid collisions with one another  Attack the enemy’s base

8 RTS Domain Task Hierarchy Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack Primitive Task Composite Task  MAXQ task hierarchy –Original SMDP is split into sub-SMDPs (subtasks) –Solving the Root task solves the entire SMDP  Each subtask M i is defined by –State abstraction B i –Actions A i –Termination (goal) predicate G i

9 Hierarchical Average Reward Learning  Value function decomposition for a recursively gain-optimal policy in Hierarchical H learning:  If the state abstractions are sound,  Root task = Bellman equation h a ( B a ( s )) = h a ( s )

10 Hierarchical Average Reward Learning  No pseudo rewards  No completion function  Scheduling is a learned behavior

11 Hierarchical Average Reward Learning  Sharing requires coordination  Coordination part of state not action (Mahadevan)  No need for each subtask to see reward

12 Single Hierarchical Agent Root Harvest(W1) Goto(W1) North Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack

13 Simple Multi-Agent Setup Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack Root Offense(E1) Attack Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack Root Harvest(W1) Goto(W1) North

14 MASH Setup Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack Root Offense(E1) Attack Root Harvest(W1) Goto(W1) North

15 Experimental Results  2 agents in a 15 x 15 grid, Pr(Resource Regeneration) = 5%; Pr(Enemy) = 1%; Rewards = (-1, 100, -5, 50); 30 runs  4 agents in a 25 × 25 grid, Pr(Resource Regeneration) = 7.5%; Pr(Enemy) = 1%; Rewards = (0, 100, -5, 50); 30 runs  Couldn’t run separate agents coordination for 4 agents 25 × 25

16 Experimental Results

17 Experimental Results (2)

18 Conclusion  Sharing value functions  Coordination  Framework to express sharing & coordination with hierarchies

19 Future Work  Non-Markovian & non-stationary  Learning the task hierarchy –Task – subtask relationships –State abstractions –Termination conditions  Combining MASH framework with factored action models  Recognizing opportunities for sharing & coordination

20 Current Work  Murthi, Russell features