AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.

Slides:



Advertisements
Similar presentations
Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.
Advertisements

This Segment: Computational game theory Lecture 1: Game representations, solution concepts and complexity Tuomas Sandholm Computer Science Department Carnegie.
Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.
Complexity of manipulating elections with few candidates Vincent Conitzer and Tuomas Sandholm Carnegie Mellon University Computer Science Department.
Game Theoretical Insights in Strategic Patrolling: Model and Analysis Nicola Gatti – DEI, Politecnico di Milano, Piazza Leonardo.
Congestion Games with Player- Specific Payoff Functions Igal Milchtaich, Department of Mathematics, The Hebrew University of Jerusalem, 1993 Presentation.
Joint Strategy Fictitious Play Sherwin Doroudi. “Adapted” from J. R. Marden, G. Arslan, J. S. Shamma, “Joint strategy fictitious play with inertia for.
ECO290E: Game Theory Lecture 5 Mixed Strategy Equilibrium.
Game Theory and Computer Networks: a useful combination? Christos Samaras, COMNET Group, DUTH.
An Introduction to... Evolutionary Game Theory
MIT and James Orlin © Game Theory 2-person 0-sum (or constant sum) game theory 2-person game theory (e.g., prisoner’s dilemma)
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
 1. Introduction to game theory and its solutions.  2. Relate Cryptography with game theory problem by introducing an example.  3. Open questions and.
EC941 - Game Theory Lecture 7 Prof. Francesco Squintani
Learning in games Vincent Conitzer
Online learning, minimizing regret, and combining expert advice
No-Regret Algorithms for Online Convex Programs Geoffrey J. Gordon Carnegie Mellon University Presented by Nicolas Chapados 21 February 2007.
Decision Errors and Power
Part 3: The Minimax Theorem
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
EC941 - Game Theory Prof. Francesco Squintani Lecture 8 1.
by Vincent Conitzer of Duke
Convergent Learning in Unknown Graphical Games Dr Archie Chapman, Dr David Leslie, Dr Alex Rogers and Prof Nick Jennings School of Mathematics, University.
Infinite Horizon Problems
Rational Learning Leads to Nash Equilibrium Ehud Kalai and Ehud Lehrer Econometrica, Vol. 61 No. 5 (Sep 1993), Presented by Vincent Mak
Complexity Results about Nash Equilibria
The Cat and The Mouse -- The Case of Mobile Sensors and Targets David K. Y. Yau Lab for Advanced Network Systems Dept of Computer Science Purdue University.
An Introduction to Game Theory Part II: Mixed and Correlated Strategies Bernhard Nebel.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.
DANSS Colloquium By Prof. Danny Dolev Presented by Rica Gonen
Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.
1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.
Optimization via Search CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.
Minimax strategies, Nash equilibria, correlated equilibria Vincent Conitzer
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
CPS Learning in games Vincent Conitzer
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Learning in Multiagent systems
1 Efficiency and Nash Equilibria in a Scrip System for P2P Networks Eric J. Friedman Joseph Y. Halpern Ian Kash.
Dynamic Games of complete information: Backward Induction and Subgame perfection - Repeated Games -
Small clique detection and approximate Nash equilibria Danny Vilenchik UCLA Joint work with Lorenz Minder.
6.853: Topics in Algorithmic Game Theory Fall 2011 Constantinos Daskalakis Lecture 11.
The repeated games with lack of information on one side Now we focus on.
The Cost and Windfall of Manipulability Abraham Othman and Tuomas Sandholm Carnegie Mellon University Computer Science Department.
Standard and Extended Form Games A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor, SIUC.
August th Computer Olympiad1 Learning Opponent-type Probabilities for PrOM search Jeroen Donkers IKAT Universiteit Maastricht.
Automated Design of Multistage Mechanisms Tuomas Sandholm (Carnegie Mellon) Vincent Conitzer (Carnegie Mellon) Craig Boutilier (Toronto)
Moshe Tennenholtz, Aviv Zohar Learning Equilibria in Repeated Congestion Games.
Game-theoretic analysis tools Tuomas Sandholm Professor Computer Science Department Carnegie Mellon University.
The Science of Networks 6.1 Today’s topics Game Theory Normal-form games Dominating strategies Nash equilibria Acknowledgements Vincent Conitzer, Michael.
Rational Cryptography Some Recent Results Jonathan Katz University of Maryland.
Complexity of Determining Nonemptiness of the Core Vincent Conitzer, Tuomas Sandholm Computer Science Department Carnegie Mellon University.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
1. 2 You should know by now… u The security level of a strategy for a player is the minimum payoff regardless of what strategy his opponent uses. u A.
MAIN RESULT: We assume utility exhibits strategic complementarities. We show: Membership in larger k-core implies higher actions in equilibrium Higher.
Vincent Conitzer CPS Learning in games Vincent Conitzer
5.1.Static Games of Incomplete Information
Definition and Complexity of Some Basic Metareasoning Problems Vincent Conitzer and Tuomas Sandholm Computer Science Department Carnegie Mellon University.
Bargaining games Econ 414. General bargaining games A common application of repeated games is to examine situations of two or more parties bargaining.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
OPPONENT EXPLOITATION Tuomas Sandholm. Traditionally two approaches to tackling games Game theory approach (abstraction+equilibrium finding) –Safe in.
Computing Shapley values, manipulating value division schemes, and checking core membership in multi-issue domains Vincent Conitzer, Tuomas Sandholm Computer.
Communication Complexity as a Lower Bound for Learning in Games
HIERARCHY THEOREMS Hu Rui Prof. Takahashi laboratory
Convergence, Targeted Optimality, and Safety in Multiagent Learning
Presented By Aaron Roth
Normal Form (Matrix) Games
A Technique for Reducing Normal Form Games to Compute a Nash Equilibrium Vincent Conitzer and Tuomas Sandholm Carnegie Mellon University, Computer Science.
Presentation transcript:

AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer and Tuomas Sandholm Computer Science Department Carnegie Mellon University

Learning in games Two aspects of learning in games: –Learning the game (or aspects of the game) itself –Learning how the opponent is behaving Many previous algorithms have interleaved this This paper focuses solely on learning with respect to the opponent –It assumes that the game is known –It assumes that an equilibrium can be computed

The setting There are N players, each with their own possible actions There is a known stage game (matrix game) which the players play repeatedly –Mapping from action vectors to payoff vectors Each round, the players decide on a distribution over their actions to play from (a mixed strategy) The players have a long-term learning strategy –Special case: a stationary strategy (play from the same distribution every time) 1,22,16,0 2,11,27,0 0,60,75,5 A 2-player, 3-action stage game

How should a stage game be played? Nash equilibrium: –Every agent has a mixed strategy (distributions over actions) –Each agent’s mixed strategy is a best response to the other’s –Makes sense for infinitely rational agents But: against a (less clever) opponent with a fixed mixed strategy, we could do better 1,22,16,0 2,11,27,0 0,60,75,5 50% 1,22,16,0 2,11,27,0 0,60,75,5 0% 50% Unique Nash equilibrium 50% 0% 51% 49% Suboptimal opponent 100% 0% Best response

Objective: two properties Our algorithm is designed to achieve two properties: Against opponents that (eventually) play from a stationary distribution, eventually play the best response –Maximum exploitation against a fixed strategy Compare opponent modeling In self-play (playing against other agents using the algorithm), the agents should converge to a Nash equilibrium

Closely related prior results Regret matching [Hart, Mas-Colell 2000]: –Regrets go to zero (which implies eventual best-responding to fixed strategies) –But: in self-play, convergence to correlated equilibrium only Correlated equilibrium is a relaxed version of Nash equilibrium WoLF-IGA [Bowling, Veloso 2002]: gets both properties given that: –The game is two-player, two-action –Players can observe each other’s mixed strategies (not just the played action) –Can use infinitesimally small step sizes AWESOME achieves both properties without any of these assumptions

Introducing AWESOME AWESOME stands for Adapt When Everybody is Stationary, Otherwise Move to Equilibrium The basic idea: –Detect if the other player is playing stationary –If so, try to play the best response –Otherwise, restart completely, go back to equilibrium strategy

AWESOME’s null hypotheses AWESOME starts with a null hypothesis that everyone is playing the (precomputed) equilibrium If this is rejected, AWESOME switches to another null hypothesis that all players are playing stationary If this is rejected, AWESOME restarts completely The current hypothesis is evaluated every epoch –Epoch = certain number of rounds –Reject the equilibrium hypothesis if the actual distribution of actions is too far from the equilibrium –Reject the stationarity hypothesis if the actual distribution changes too much between epochs We will discuss how to reject hypotheses later

What does AWESOME play? While the equilibrium hypothesis is maintained, it also plays its equilibrium strategy –The goal of the equilibrium hypothesis is that we do not move away from the (possibly mixed-strategy) equilibrium because AWESOME starts playing (pure-strategy) best responses When the equilibrium hypothesis is rejected, AWESOME picks a random action to play Then, if another action appears to be (significantly) better against what the others were playing in the last epoch, AWESOME will switch to that action –Significant difference is necessary to prevent AWESOME from jumping around between equivalent actions, which may cause restarts

A naïve approach Say we were to apply the same test of the hypothesis every epoch: –Same number of rounds every epoch –If the observed distribution of actions deviates more than epsilon from the hypothesized distribution, reject it Epoch 1 Epoch 2 … … hypothesized distribution bounds for accepting the hypothesis probability of acceptance (given hypothesis is true) Fraction of times action 1 played (on x-axis)

Two problems with the naïve approach Even if the hypothesis is true, each epoch there is constant probability of rejecting –Possibly, by fluke, the actual distribution looks nothing like the hypothesized one How do we distinguish a distribution within epsilon from the hypothesized one? –E.g. if another player plays almost the equilibrium strategy, we want to best-respond (presumably a pure strategy), not play the mixed equilibrium strategy Epoch 1 Epoch 2 … …

Solution Let the epoch length increase, while the test gets stronger (observed distribution should get closer to the hypothesized distribution) Epoch 1 Epoch 2 … … Acceptable margin has decreased… …but the distribution is much narrower (more rounds)… …so that the chance of acceptance has actually increased! If the chance of rejection is decreased fast enough, with nonzero probability, we will never reject!

Final proof details We define what constitutes a valid schedule for changing the epsilons (one for each hypothesis) and the number of rounds per epoch –Number of rounds should increase fast enough to get nonzero probability of never restarting (if hypothesis is true) Chebyshev’s inequality then allows us to bound the probability of restart in a given epoch This allows us to prove the paper’s main results: Theorem. AWESOME (with a valid schedule) converges to a best response against (eventually) stationary opponents. Theorem. AWESOME (with a valid schedule) converges to a Nash equilibrium in self-play. –Interestingly, it is not always the pre-computed equilibrium!

The algorithm

Summary AWESOME is (to our knowledge) the first algorithm for learning in general repeated games that: –Converges to a best response against (eventually) stationary opponents –Converges to a Nash equilibrium in self-play Basic idea: try to adapt (best-respond) when everybody appears to be playing stationary strategies, but otherwise go back to the equilibrium AWESOME achieves this by testing various hypotheses each epoch of rounds Convergence can be proved for carefully constructed schedules for simultaneously increasing –the strength of the test –the number of rounds per epoch

Future research Speed of convergence –Does AWESOME converge fast? For which schedules of increasing the number of rounds per epoch & strength of the test? –Can it be changed to converge faster? Does AWESOME have additional properties? –For example, basic idea seems fairly “safe” in zero-sum games –Add code to the algorithm skeleton to obtain other properties? Fewer assumptions –Can we integrate learning the structure of the game? Can AWESOME be simplified? –E.g., not having to compute a Nash equilibrium Are the proof techniques we used useful elsewhere?

Thank you for your attention!