Download presentation
Presentation is loading. Please wait.
1
Multi-Agent Shared Hierarchy Reinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University
2
2 Highlights Sharing value functions Coordination Framework to express sharing & coordination with hierarchies RTS domain
3
3 Previous Work MAXQ, Options, ALisp Coordination in the hierarchical setting (Makar, Mahadevan) Sharing flat value functions (Tan) Concurrent reinforcement learning for multiple effectors (Murthi, Russell, …)
4
4 Outline Average Reward Learning RTS domain Hierarchical ARL MASH framework Experimental results Conclusion & future work
5
5 SMDP Semi-Markov Decision Process (SMDP) extends MDPs by allowing for temporally extended actions –States S –Actions A –Transition function P(s’, N|s, a) –Reward function R(s’|s, a) –Time function T(s’|s, a) Given an SMDP, an agent in state s following policy , G a i n½ ¼ ( s ) = l i m N ! 1 E ( P N i = 0 r i ) E ( P N i = 0 t i )
6
6 Average Reward Learning Taking action a in state s –Immediate reward r(s, a) –Action duration t(s, a) Average-adjusted reward = Optimal policy * maximizes the RHS, and leads to the optimal gain h ¼ ( s 0 ) = E [( r ( s 0 ; a ) ¡ ½ t ( s 0 ; a )) + ( r ( s 1 ; a ) ¡ ½ t ( s 1 ; a )) + ¢¢¢ ] )h ¼ ( s 0 ) = E [ r ( s 0 ; a ) ¡ ½ t ( s 0 ; a )] + h ¼ ( s 1 ) s 0 s 1 s 2 s n s 0 s n r ( s 0 ; a 0 ) ¡ ½ t ( s 0 ; a 0 ) r ( s 0 ; a 1 ) ¡ ½ t ( s 0 ; a 1 ) r ( s 0 ; a 2 ) ¡ ½ t ( s 0 ; a 2 ) Parent task Child task r ( s ; a ) ¡ ½ ¼ t ( s ; a ) ½ ¼ ¤ ¸ ½ ¼
7
7 RTS Domain Grid world domain Multiple peasants mine resources (wood, gold) to replenish the home stock Avoid collisions with one another Attack the enemy’s base
8
8 RTS Domain Task Hierarchy Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack Primitive Task Composite Task MAXQ task hierarchy –Original SMDP is split into sub-SMDPs (subtasks) –Solving the Root task solves the entire SMDP Each subtask M i is defined by –State abstraction B i –Actions A i –Termination (goal) predicate G i
9
9 Hierarchical Average Reward Learning Value function decomposition for a recursively gain-optimal policy in Hierarchical H learning: If the state abstractions are sound, Root task = Bellman equation h a ( B a ( s )) = h a ( s )
10
10 Hierarchical Average Reward Learning No pseudo rewards No completion function Scheduling is a learned behavior
11
11 Hierarchical Average Reward Learning Sharing requires coordination Coordination part of state not action (Mahadevan) No need for each subtask to see reward
12
12 Single Hierarchical Agent Root Harvest(W1) Goto(W1) North Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack
13
13 Simple Multi-Agent Setup Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack Root Offense(E1) Attack Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack Root Harvest(W1) Goto(W1) North
14
14 MASH Setup Root Harvest(l)Deposit Goto(k) EastSouthNorthWest PickPut Offense(e) Idle Attack Root Offense(E1) Attack Root Harvest(W1) Goto(W1) North
15
15 Experimental Results 2 agents in a 15 x 15 grid, Pr(Resource Regeneration) = 5%; Pr(Enemy) = 1%; Rewards = (-1, 100, -5, 50); 30 runs 4 agents in a 25 × 25 grid, Pr(Resource Regeneration) = 7.5%; Pr(Enemy) = 1%; Rewards = (0, 100, -5, 50); 30 runs Couldn’t run separate agents coordination for 4 agents 25 × 25
16
16 Experimental Results
17
17 Experimental Results (2)
18
18 Conclusion Sharing value functions Coordination Framework to express sharing & coordination with hierarchies
19
19 Future Work Non-Markovian & non-stationary Learning the task hierarchy –Task – subtask relationships –State abstractions –Termination conditions Combining MASH framework with factored action models Recognizing opportunities for sharing & coordination
20
20 Current Work Murthi, Russell features
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.