Mutually-guided Multi-agent Learning

Slides:

Advertisements

Similar presentations

Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.

Advertisements

Game Theory Assignment For all of these games, P1 chooses between the columns, and P2 chooses between the rows.

Markov Decision Process

Coalition Formation and Price of Anarchy in Cournot Oligopolies Joint work with: Nicole Immorlica (Northwestern University) Georgios Piliouras (Georgia.

Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.

Markov Decision Processes

Differential Game Theory Notes by Alberto Bressan.

Game Playing CSC361 AI CSC361: Game Playing.

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Outline  In-Class Experiment on a Coordination Game  Test of Equilibrium Selection I :Van Huyck, Battalio, and Beil (1990)  Test of Equilibrium Selection.

Outline  In-Class Experiment on a Coordination Game  Test of Equilibrium Selection I :Van Huyck, Battalio, and Beil (1990)  Test of Equilibrium Selection.

XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Lecture Slides Dixit and Skeath Chapter 4

Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.

1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

1 Endgame Logistics  Final Project Presentations  Tuesday, March 19, 3-5, KEC2057  Powerpoint suggested ( to me before class)  Can use your own.

Exponential Moving Average Q- Learning Algorithm By Mostafa D. Awheda Howard M. Schwartz Presented at the 2013 IEEE Symposium Series on Computational Intelligence.

Software Multiagent Systems: Lecture 13 Milind Tambe University of Southern California

Lecture 4 Strategic Interaction Game Theory Pricing in Imperfectly Competitive markets.

Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman.

1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.

Software Multiagent Systems: CS543 Milind Tambe University of Southern California

Adversarial Search. Game playing u Multi-agent competitive environment u The most common games are deterministic, turn- taking, two-player, zero-sum game.

On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Microeconomics Course E

Many Senders (part 3) L10.

Introduction to Game Theory

Announcements Homework 1 Full assignment posted..

Introduction to Game Theory

New Characterizations in Turnstile Streams with Applications

Monte Carlo simulation

Information Design: A unified Perspective

Markov Decision Processes

Communication Complexity as a Lower Bound for Learning in Games

Vincent Conitzer CPS Repeated games Vincent Conitzer

Information Design: A unified Perspective

Reinforcement Learning

Artificial Intelligence

Convergence, Targeted Optimality, and Safety in Multiagent Learning

CS 188: Artificial Intelligence

Announcements Homework 3 due today (grace period through Friday)

Multiagent Systems Extensive Form Games © Manfred Huber 2018.

Multiagent Systems Game Theory © Manfred Huber 2018.

Game Theory in Wireless and Communication Networks: Theory, Models, and Applications Lecture 10 Stochastic Game Zhu Han, Dusit Niyato, Walid Saad, and.

CAP 5636 – Advanced Artificial Intelligence

CASE − Cognitive Agents for Social Environments

Artificial Intelligence

Computing Nash Equilibrium

Multiagent Systems Repeated Games © Manfred Huber 2018.

Vincent Conitzer Repeated games Vincent Conitzer

Introduction to Reinforcement Learning and Q-Learning

Lecture 20 Linear Program Duality

EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF INDUSTRIAL ENGINEERING IENG314 OPERATIONS RESEARCH II SAMIR SAMEER ABUYOUSSEF

Operations Research: Applications and Algorithms

Adversarial Search CMPT 420 / CMPG 720.

Game Theory: The Nash Equilibrium

Markov Decision Processes

Collaboration in Repeated Games

Normal Form (Matrix) Games

Markov Decision Processes

Vincent Conitzer CPS Repeated games Vincent Conitzer

Reinforcement Learning

Presentation transcript:

Mutually-guided Multi-agent Learning Raghav Aras Alain Dutech François Charpillet (MAIA) June 2004

Outline A review of some Multiagent Q-learning approaches Our approach for Multiagent learning in a stochastic game Some preliminary results

Multiagent Q-Learning (1) Q-Learning: (single-agent learning) Q(st, at)  (1 - ) Q(st, at) +  [Rt +  maxa Q(st+1, a)] Known to converge to optimal values minimax-Q: (for zero-sum, 2-player games) V1(s) ← maxP1 ∈ Π(A1) mina2 ∈ A2 a1∈A1 P1(a1)Q1(s,(a1, a2)) Known to converge to optimal values

Multiagent Q-Learning (2) Nash-Q learning: (for n-agent, general sum SGs) Qi(s, a1,..,an)  (1 - ) Qi(s, a1,..,an) +  [Ri +  NashQi(s’)] where NashQi(s’) = Qi(s’, 1(s’) 2(s’)…n(s’)) Convergences under strict conditions (existence, uniqueness of Nash equilibria)

Drawbacks of NashQ learning Coordination in choice of Nash Equilibrium Observability of all actions and all rewards Space complexity (each agent): n |S||A|n

The problem that we treat… n-agent SG <S, A1..An, R1..Rn, P>: Ri: S x (A1 x…An)   P: S x (A1 x…An) x S  {0,1} (deterministic) i  S (set of equally good goal states)  = 1  2 …n ||  1 (atleast one common goal state) An agent’s payoff is the same in all its goal states Agents’ payoffs may be different in the common goal state

Our Interest A more realistic assumption for SGs: (Actions, rewards of other agents hidden) Investigating « independent » learning in SGs (Leading also to scalability) Using communication to forge cooperation A single-agent learning algorithm giving maximum payoff to a maximum number of agents

1 1 1 Communication in Our Approach Agents send and receive ‘ping messages’ Sending a message is an action A ping message, …is an n-1 sized array of 0s and 1s …has no content 1 2 3 4 5 Agent 1 sends Agent 2 gets 1 3 4 5 1 2 4 5 Agent 3 gets

Communication based Q-Values Agent state = <game state, message received> Agent action = <basic action, message to send> 2n – 1 possible messages Mi, agent i message set State set = S x Mi Action set = Ai x Mi Size of Q-value set: S x Mi x Ai x Mi Agent policy i: S x Mi  Ai x Mi

What do we envisage the messages doing? Alert others of proximity to a goal state Discover the common goal state Enforce preference for the common goal state Main principle of our algorithm: Play safe by inverting actual rewards Create artificial rewards based on messages

The Q-comm Learning algorithm Agent i initial state i  <S, > Loop (each agent) Select i  <ai, messsend> (Boltzmann, -greedy) Execute i, observe reward Ri i  <S’, messrecd> (next state) RMi  (Ri . messsend) + (Ri . messrecd) Invert reward, Ri  -1 Ri Qi(i, i)  (1 - ) Qi(i, i) +  [Ri + RMi +  max Qi(i , )] i  i, S  S’ Until S   (a goal state)

An n-digit array (number) controlled by n agents Test Problem: Find the Winning Number (FWN) 9 3 5 8 An n-digit array (number) controlled by n agents Each agent controls a digit Actions: +1, -1, 0 i, list of « winning » numbers for i (unknown) Each num  i gives equal payoff to i  = 1  2 …n  contains a common « winning » number

Results (1): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}

Results (2): 3 agent FWN 1 = {2, 16, 119}, 2 = {68, 102, 119} , 3 = {37, 86, 119}

Agents select one common goal Results (3): Multiple Common Goals Agents select one common goal

Not all agents satisfied! Results (4): 4 agent FWN Not all agents satisfied!

Summary of Results: Empirically, Q-comm learning finds the common goal Works with multiple common goals Agents coordinate equilibrium choice Works with upto 3 agents Doesn’t always work for 4 or more agents

Future work: Increase scalability by localising communication Investigate how it can work for n  4 Analyse convergence

Thank you! Your questions…