Mutually-guided Multi-agent Learning

Slides:



Advertisements
Similar presentations
Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.
Advertisements

Game Theory Assignment For all of these games, P1 chooses between the columns, and P2 chooses between the rows.
Markov Decision Process
Coalition Formation and Price of Anarchy in Cournot Oligopolies Joint work with: Nicole Immorlica (Northwestern University) Georgios Piliouras (Georgia.
Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.
Markov Decision Processes
Differential Game Theory Notes by Alberto Bressan.
Game Playing CSC361 AI CSC361: Game Playing.
INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.
Outline  In-Class Experiment on a Coordination Game  Test of Equilibrium Selection I :Van Huyck, Battalio, and Beil (1990)  Test of Equilibrium Selection.
Outline  In-Class Experiment on a Coordination Game  Test of Equilibrium Selection I :Van Huyck, Battalio, and Beil (1990)  Test of Equilibrium Selection.
XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.
Lecture Slides Dixit and Skeath Chapter 4
Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.
1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
1 Endgame Logistics  Final Project Presentations  Tuesday, March 19, 3-5, KEC2057  Powerpoint suggested ( to me before class)  Can use your own.
Exponential Moving Average Q- Learning Algorithm By Mostafa D. Awheda Howard M. Schwartz Presented at the 2013 IEEE Symposium Series on Computational Intelligence.
Software Multiagent Systems: Lecture 13 Milind Tambe University of Southern California
Lecture 4 Strategic Interaction Game Theory Pricing in Imperfectly Competitive markets.
Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman.
1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.
Software Multiagent Systems: CS543 Milind Tambe University of Southern California
Adversarial Search. Game playing u Multi-agent competitive environment u The most common games are deterministic, turn- taking, two-player, zero-sum game.
On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Microeconomics Course E
Many Senders (part 3) L10.
Introduction to Game Theory
Announcements Homework 1 Full assignment posted..
Introduction to Game Theory
New Characterizations in Turnstile Streams with Applications
Monte Carlo simulation
Information Design: A unified Perspective
Markov Decision Processes
Communication Complexity as a Lower Bound for Learning in Games
Vincent Conitzer CPS Repeated games Vincent Conitzer
Information Design: A unified Perspective
Reinforcement Learning
Artificial Intelligence
Convergence, Targeted Optimality, and Safety in Multiagent Learning
CS 188: Artificial Intelligence
Announcements Homework 3 due today (grace period through Friday)
Multiagent Systems Extensive Form Games © Manfred Huber 2018.
Multiagent Systems Game Theory © Manfred Huber 2018.
Game Theory in Wireless and Communication Networks: Theory, Models, and Applications Lecture 10 Stochastic Game Zhu Han, Dusit Niyato, Walid Saad, and.
CAP 5636 – Advanced Artificial Intelligence
CASE − Cognitive Agents for Social Environments
Artificial Intelligence
Computing Nash Equilibrium
Multiagent Systems Repeated Games © Manfred Huber 2018.
Vincent Conitzer Repeated games Vincent Conitzer
Introduction to Reinforcement Learning and Q-Learning
Lecture 20 Linear Program Duality
EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF INDUSTRIAL ENGINEERING IENG314 OPERATIONS RESEARCH II SAMIR SAMEER ABUYOUSSEF
Operations Research: Applications and Algorithms
Adversarial Search CMPT 420 / CMPG 720.
Game Theory: The Nash Equilibrium
Markov Decision Processes
Collaboration in Repeated Games
Normal Form (Matrix) Games
Markov Decision Processes
Vincent Conitzer CPS Repeated games Vincent Conitzer
Reinforcement Learning
Presentation transcript:

Mutually-guided Multi-agent Learning Raghav Aras Alain Dutech François Charpillet (MAIA) June 2004

Outline A review of some Multiagent Q-learning approaches Our approach for Multiagent learning in a stochastic game Some preliminary results

Multiagent Q-Learning (1) Q-Learning: (single-agent learning) Q(st, at)  (1 - ) Q(st, at) +  [Rt +  maxa Q(st+1, a)] Known to converge to optimal values minimax-Q: (for zero-sum, 2-player games) V1(s) ← maxP1 ∈ Π(A1) mina2 ∈ A2 a1∈A1 P1(a1)Q1(s,(a1, a2)) Known to converge to optimal values

Multiagent Q-Learning (2) Nash-Q learning: (for n-agent, general sum SGs) Qi(s, a1,..,an)  (1 - ) Qi(s, a1,..,an) +  [Ri +  NashQi(s’)] where NashQi(s’) = Qi(s’, 1(s’) 2(s’)…n(s’)) Convergences under strict conditions (existence, uniqueness of Nash equilibria)

Drawbacks of NashQ learning Coordination in choice of Nash Equilibrium Observability of all actions and all rewards Space complexity (each agent): n |S||A|n

The problem that we treat… n-agent SG <S, A1..An, R1..Rn, P>: Ri: S x (A1 x…An)   P: S x (A1 x…An) x S  {0,1} (deterministic) i  S (set of equally good goal states)  = 1  2 …n ||  1 (atleast one common goal state) An agent’s payoff is the same in all its goal states Agents’ payoffs may be different in the common goal state

Our Interest A more realistic assumption for SGs: (Actions, rewards of other agents hidden) Investigating « independent » learning in SGs (Leading also to scalability) Using communication to forge cooperation A single-agent learning algorithm giving maximum payoff to a maximum number of agents

1 1 1 Communication in Our Approach Agents send and receive ‘ping messages’ Sending a message is an action A ping message, …is an n-1 sized array of 0s and 1s …has no content 1 2 3 4 5 Agent 1 sends Agent 2 gets 1 3 4 5 1 2 4 5 Agent 3 gets

Communication based Q-Values Agent state = <game state, message received> Agent action = <basic action, message to send> 2n – 1 possible messages Mi, agent i message set State set = S x Mi Action set = Ai x Mi Size of Q-value set: S x Mi x Ai x Mi Agent policy i: S x Mi  Ai x Mi

What do we envisage the messages doing? Alert others of proximity to a goal state Discover the common goal state Enforce preference for the common goal state Main principle of our algorithm: Play safe by inverting actual rewards Create artificial rewards based on messages

The Q-comm Learning algorithm Agent i initial state i  <S, > Loop (each agent) Select i  <ai, messsend> (Boltzmann, -greedy) Execute i, observe reward Ri i  <S’, messrecd> (next state) RMi  (Ri . messsend) + (Ri . messrecd) Invert reward, Ri  -1 Ri Qi(i, i)  (1 - ) Qi(i, i) +  [Ri + RMi +  max Qi(i , )] i  i, S  S’ Until S   (a goal state)

An n-digit array (number) controlled by n agents Test Problem: Find the Winning Number (FWN) 9 3 5 8 An n-digit array (number) controlled by n agents Each agent controls a digit Actions: +1, -1, 0 i, list of « winning » numbers for i (unknown) Each num  i gives equal payoff to i  = 1  2 …n  contains a common « winning » number

Results (1): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}

Results (2): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}

Agents select one common goal Results (3): Multiple Common Goals Agents select one common goal

Not all agents satisfied! Results (4): 4 agent FWN Not all agents satisfied!

Summary of Results: Empirically, Q-comm learning finds the common goal Works with multiple common goals Agents coordinate equilibrium choice Works with upto 3 agents Doesn’t always work for 4 or more agents

Future work: Increase scalability by localising communication Investigate how it can work for n  4 Analyse convergence

Thank you! Your questions…