Download presentation
Presentation is loading. Please wait.
1
Mutually-guided Multi-agent Learning
Raghav Aras Alain Dutech François Charpillet (MAIA) June 2004
2
Outline A review of some Multiagent Q-learning approaches
Our approach for Multiagent learning in a stochastic game Some preliminary results
3
Multiagent Q-Learning (1)
Q-Learning: (single-agent learning) Q(st, at) (1 - ) Q(st, at) + [Rt + maxa Q(st+1, a)] Known to converge to optimal values minimax-Q: (for zero-sum, 2-player games) V1(s) ← maxP1 ∈ Π(A1) mina2 ∈ A2 a1∈A1 P1(a1)Q1(s,(a1, a2)) Known to converge to optimal values
4
Multiagent Q-Learning (2)
Nash-Q learning: (for n-agent, general sum SGs) Qi(s, a1,..,an) (1 - ) Qi(s, a1,..,an) + [Ri + NashQi(s’)] where NashQi(s’) = Qi(s’, 1(s’) 2(s’)…n(s’)) Convergences under strict conditions (existence, uniqueness of Nash equilibria)
5
Drawbacks of NashQ learning
Coordination in choice of Nash Equilibrium Observability of all actions and all rewards Space complexity (each agent): n |S||A|n
6
The problem that we treat…
n-agent SG <S, A1..An, R1..Rn, P>: Ri: S x (A1 x…An) P: S x (A1 x…An) x S {0,1} (deterministic) i S (set of equally good goal states) = 1 2 …n || 1 (atleast one common goal state) An agent’s payoff is the same in all its goal states Agents’ payoffs may be different in the common goal state
7
Our Interest A more realistic assumption for SGs:
(Actions, rewards of other agents hidden) Investigating « independent » learning in SGs (Leading also to scalability) Using communication to forge cooperation A single-agent learning algorithm giving maximum payoff to a maximum number of agents
8
1 1 1 Communication in Our Approach
Agents send and receive ‘ping messages’ Sending a message is an action A ping message, …is an n-1 sized array of 0s and 1s …has no content 1 2 3 4 5 Agent 1 sends Agent 2 gets 1 3 4 5 1 2 4 5 Agent 3 gets
9
Communication based Q-Values
Agent state = <game state, message received> Agent action = <basic action, message to send> 2n – 1 possible messages Mi, agent i message set State set = S x Mi Action set = Ai x Mi Size of Q-value set: S x Mi x Ai x Mi Agent policy i: S x Mi Ai x Mi
10
What do we envisage the messages doing?
Alert others of proximity to a goal state Discover the common goal state Enforce preference for the common goal state Main principle of our algorithm: Play safe by inverting actual rewards Create artificial rewards based on messages
11
The Q-comm Learning algorithm
Agent i initial state i <S, > Loop (each agent) Select i <ai, messsend> (Boltzmann, -greedy) Execute i, observe reward Ri i <S’, messrecd> (next state) RMi (Ri . messsend) + (Ri . messrecd) Invert reward, Ri -1 Ri Qi(i, i) (1 - ) Qi(i, i) + [Ri + RMi + max Qi(i , )] i i, S S’ Until S (a goal state)
12
An n-digit array (number) controlled by n agents
Test Problem: Find the Winning Number (FWN) 9 3 5 8 An n-digit array (number) controlled by n agents Each agent controls a digit Actions: +1, -1, 0 i, list of « winning » numbers for i (unknown) Each num i gives equal payoff to i = 1 2 …n contains a common « winning » number
13
Results (1): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}
14
Results (2): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}
15
Agents select one common goal
Results (3): Multiple Common Goals Agents select one common goal
16
Not all agents satisfied!
Results (4): 4 agent FWN Not all agents satisfied!
17
Summary of Results: Empirically, Q-comm learning finds the common goal
Works with multiple common goals Agents coordinate equilibrium choice Works with upto 3 agents Doesn’t always work for 4 or more agents
18
Future work: Increase scalability by localising communication
Investigate how it can work for n 4 Analyse convergence
19
Thank you! Your questions…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.