Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What.

Slides:

Advertisements

Similar presentations

Vincent Conitzer CPS Repeated games Vincent Conitzer

Advertisements

Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.

Markov Decision Process

Continuation Methods for Structured Games Ben Blum Christian Shelton Daphne Koller Stanford University.

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

This Segment: Computational game theory Lecture 1: Game representations, solution concepts and complexity Tuomas Sandholm Computer Science Department Carnegie.

Mixed Strategies CMPT 882 Computational Game Theory Simon Fraser University Spring 2010 Instructor: Oliver Schulte.

Game Theoretical Insights in Strategic Patrolling: Model and Analysis Nicola Gatti – DEI, Politecnico di Milano, Piazza Leonardo.

Chapter 6 Game Theory © 2006 Thomson Learning/South-Western.

An Introduction to... Evolutionary Game Theory

MIT and James Orlin © Game Theory 2-person 0-sum (or constant sum) game theory 2-person game theory (e.g., prisoner’s dilemma)

© 2015 McGraw-Hill Education. All rights reserved. Chapter 15 Game Theory.

EC941 - Game Theory Lecture 7 Prof. Francesco Squintani

Response Regret Martin Zinkevich AAAI Fall Symposium November 5 th, 2005 This work was supported by NSF Career Grant #IIS

General-sum games You could still play a minimax strategy in general- sum games –I.e., pretend that the opponent is only trying to hurt you But this is.

An Introduction to Game Theory Part I: Strategic Games

Chapter 6 © 2006 Thomson Learning/South-Western Game Theory.

Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.

IN SEARCH OF VALUE EQUILIBRIA By Christopher Kleven & Dustin Richwine xkcd.com.

IN SEARCH OF VALUE EQUILIBRIA By Christopher Kleven & Dustin Richwine xkcd.com.

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

Game-Theoretic Approaches to Multi-Agent Systems Bernhard Nebel.

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.

AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

QR 38, 2/22/07 Strategic form: dominant strategies I.Strategic form II.Finding Nash equilibria III.Strategic form games in IR.

Multi-Agent Learning Mini-Tutorial Gerry Tesauro IBM T.J.Watson Research Center

Introduction to Game Theory Yale Braunstein Spring 2007.

On Partially Controlled Multi-Agent Systems By: Ronan I. Brafman and Moshe Tennenholtz Presentation By: Katy Milkman CS286r - April 12, 2006.

On Bounded Rationality and Computational Complexity Christos Papadimitriou and Mihallis Yannakakis.

Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

1 Issues on the border of economics and computation נושאים בגבול כלכלה וחישוב Congestion Games, Potential Games and Price of Anarchy Liad Blumrosen ©

MAKING COMPLEX DEClSlONS

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Learning in Multiagent systems

Introduction Many decision making problems in real life

Standard and Extended Form Games A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor, SIUC.

Presenter: Chih-Yuan Chou GA-BASED ALGORITHMS FOR FINDING EQUILIBRIUM 1.

Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman.

Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.

Game Theory: introduction and applications to computer networks Game Theory: introduction and applications to computer networks Lecture 2: two-person non.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Ásbjörn H Kristbjörnsson1 The complexity of Finding Nash Equilibria Ásbjörn H Kristbjörnsson Algorithms, Logic and Complexity.

1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.

1 Algorithms for Computing Approximate Nash Equilibria Vangelis Markakis Athens University of Economics and Business.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

 When playing two person finitely repeated games, do people behave like they are adapting a policy directly(Gradient Accent) or do they behave like they.

Game theory basics A Game describes situations of strategic interaction, where the payoff for one agent depends on its own actions as well as on the actions.

Making complex decisions

Extensive-form games and how to solve them

Vincent Conitzer CPS Repeated games Vincent Conitzer

Multiagent Systems Game Theory © Manfred Huber 2018.

Game Theory in Wireless and Communication Networks: Theory, Models, and Applications Lecture 10 Stochastic Game Zhu Han, Dusit Niyato, Walid Saad, and.

Multiagent Systems Repeated Games © Manfred Huber 2018.

Vincent Conitzer Repeated games Vincent Conitzer

Chapter 17 – Making Complex Decisions

Collaboration in Repeated Games

Vincent Conitzer CPS Repeated games Vincent Conitzer

Presentation transcript:

Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What is “good” multiagent learning?

What is “good” multiagent learning 2 The prescriptive non-cooperative agenda [Shoham et al. 07]  We are interested in problems where an agent needs to interact in open environments integrated by other agents. What's a “good” strategy in this situation? Can the monkey find a “good” strategy? or does it need to learn?  View: single agent perspective of the multiagent problem.  Environment dependent.

What is “good” multiagent learning 3 Multiagent Reinforcement Learning Framework unknown world: learning known world: solving Single-agent Multiple agents MDPs matrix games Decision Theory, Planning stochastic games

Dipartimento di Elettronica ed Informazione What is “good” multiagent learning? 4 Game theory and multiagent learning: brief backgrounds Game theory  Stochastic games  Solution concepts Multiagent learning  Solution concepts  Relation to game theory

What is “good” multiagent learning Stochastic games (SG)  SGs are good examples of how agents' behaviours depend on each other.  Strategies represent the way agents' behave  Strategies might change as a function of other strategies. Game theory mathematically captures behaviour in strategic situations. A B $ backgrounds→game theory

What is “good” multiagent learning A Computational Example: SG version of chicken  actions: U, D, R, L, X  coin flip on collision  Semiwalls (50%)  collision = -5;  step cost = -1;  goal = +100;  discount factor = 0.95;  both can get goal. SG of chicken [Hu & Wellman, 03] A B $ backgrounds→game theory

What is “good” multiagent learning Strategies on the SG of chicken A B $ Average expected reward:  (88.3,43.7);  (43.7,88.3);  (66,66);  (43.7,43.7);  (38.7,38.7);  (83.6,83.6) backgrounds→game theory

What is “good” multiagent learning Solution concept: equilibria backgrounds→game theory Definition [equilibrium]. An n-tuple of strategies (one for each agent) is an equilibrium point if -- No agent has an incentive to deviate from its current strategy --

What is “good” multiagent learning A B $ Equilibrium values Average total reward on equilibrium:  Nash (88.3,43.7) very imbalanced, inefficient (43.7,88.3) very imbalanced, inefficient (53.6,53.6) ½ mix, still inefficient  Correlated ([43.7,88.3],[43.7,88.3]);  Minimax (43.7,43.7);  Friend (38.7,38.7) backgrounds→game theory Computationally difficult to find in general

What is “good” multiagent learning Repeated Games  What if agents are allowed to play multiple times?  Strategies: Can be a function of history Can be randomized  Nash equilibrium still exists.

What is “good” multiagent learning Computing strategies for repeated SGs  Complete information: solve Exact or approximate solutions  Incomplete information: learn The environment (as perceived by the agent) is not Markovian Convergence is not guaranteed − Exceptions: zero-sum and team games Unwanted cycles and unpredicted behaviour appear There are algorithms for solving and learning that use the same successive approximations to the Bellman equations to derive solution policies.

What is “good” multiagent learning Learning equilibrium strategies in SGs  Multiagent RL updates are based on the Bellman equations (just as RL):  A value iteration (VI) algorithm solve for the optimal Q function  Finding a solution via VI depends on the operator Eq{·} : How can multiagent RL learn any of those strategies?

What is “good” multiagent learning B $ Defining optimality  the safest  the one that minimizes the opponent's reward  the one that maximizes the opponent's reward  the socially stable one What’s A ’s optimal strategy? In an open environment, an optimal strategy is arguable and may be defined by several criteria. A

What is “good” multiagent learning Defining optimality: our criteria  Optimality: should obtain close to maximum utility against other best response algorithms.  Security: should guarantee a minimum lower bound utility.  Simplicity: should be intuitive to understand and implement.  Adaptivity: should learn how to behave optimal, and remain optimal (even if environment changes).

What is “good” multiagent learning Observation: Reinforcement Learning updates  Q-learning converges to a BR strategy in MDPs Definition [best response]. A best response function BR(·) returns the set of all strategies that are optimal against the environment's joint strategy. example environment: only agents with fixed strategies backgrounds→multiagent RL observation 2: a learner's BR can be modified by a change in the environment's fixed strategy. observation 1: a learner's BR is optimal against fixed strategies.

Dipartimento di Elettronica ed Informazione What is “good” multiagent learning? 16 Social Rewards Shaping rewards and intrinsic motivations Leader and follower strategies Open questions Joint work with: Monica Babes Michael L. Littman

What is “good” multiagent learning Social rewards: hints from the brain  We’re smart, but evolution doesn’t trust us to plan all that far ahead.  Evolution programs us to want things likely to bring about what we need:  taste -> nutrition  pleasure -> procreation  eye contact -> care  generosity -> cooperation Social Rewards→motivations

What is “good” multiagent learning Is cooperation “pleasurable”?  fMRI study during repeated prisoner’s dilemma showed that humans perceive: mutual cooperation “internal rewards” (activity in the brain’s reward center)‏ defection + - Social Rewards→motivations

What is “good” multiagent learning Social Rewards: its telescoping effect  Objective: change the behavior of the learner by influencing its early experience. Shaping rewards [Ng et al., 99] Intrinsic motivation [Singh et al., 04] Social rewards Social Rewards→snapshot

What is “good” multiagent learning Social Rewards: guiding learners to better equilibria  Objective: change the behavior of the learner by influencing its early experience. Shaping rewards [Ng et al., 99] Intrinsic motivation [Singh et al., 04] Social rewards Social Rewards→snapshot

What is “good” multiagent learning Leader and follower reasoning [Littman and Stone, 01]  A leader strategy is able to guide a best response learner.  Assumption: the opponent will adapt to its decisions. A B $ In the example A is a leader and B is a follower  A best response learner is a follower.  Assumption: its behaviour doesn't hurt nobody. leaders followers Social Rewards→introduction

What is “good” multiagent learning Leader strategies  Assumption: opponent is playing a best response. -10,-101,-1 center -1,10,0 wall centerwall Leader fixed strategies agent B agent A Matrix game of chicken. leader follower BR B (wall) = center R A (wall,center) = -1 BR B (center) = wall R A (center,wall) = 1 Social Rewards→introduction

What is “good” multiagent learning$ $B$B$B$B $A$A$A$A A B Leader mutual advantage strategies  Easy to say way: compute convex hull.  Easy to compute way: Compute attack and defence strategies. Compute mutual advantage strategy. Use attack strategy as threat to deviations. the SG version of the prisoner's dilemma [Munoz de Cote and Littman, 2008] One-shot Nash Mutual advantage Nash in the repeated game Social Rewards→introduction

What is “good” multiagent learning How can a learner be also a leader?  We influence the best response learner's early experience with special shaping rewards called “social rewards”  The learner starts as a leader  If opponent is not a BR follower, the social shaping is washed away. Social Rewards→methodology

What is “good” multiagent learning Shaping Based on Potentials  Idea: each state is assigned a potential Φ(s) [Ng et al, 1999],  On each transition, utility is augmented with the difference in potential, Social Rewards→methodology

What is “good” multiagent learning 26 The Q+shaping algorithm  Compute attack and defence strategies.  Compute mutual advantage strategy For repeated matrix games use [Littman and Stone,2003] algorithm For repeated stochastic games use [Munoz de Cote and Littman, 2008] algorithm  Compute the state values (potentials) for the mutual advantage strategy.  Initialize the Q-table with the potential based function F(s,s’). The attack strategy as threat to deviations will teach BR learners better mutual advantage strategies. Theorem [Wiewiora 03] : shaping based on potentials has the same effect as initializing the Q function with the potential values. Q+shaping's main objective is to lead or follow, as appropriate Social Rewards→algorithm

Dipartimento di Elettronica ed Informazione What is “good” multiagent learning? 27 A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Joint work with: Michael L. Littman

What is “good” multiagent learning 28 Main Result  Given a repeated stochastic game, return a strategy profile that is a Nash equilibrium (specifically one whose payo ff s match the egalitarian point) of the average payo ff repeated stochastic game in polynomial time. Concretely, we address the following computational problem: v1v1 v2v2 Convex hull of the average payoffs egalitarian line Repeated SG Nash algorithm→result

What is “good” multiagent learning 29 How? (the short story version)  Compute minimax (security) strategies. Solve two linear programming problems.  The algorithm searches for a point: egalitarian line v2v2 2 v1v1 where P Convex hull of a hypothetical SG  P is the point with the highest egalitarian value. Repeated SG Nash algorithm→result

What is “good” multiagent learning 30 How? (the search for point P) egalitarian line v2v2 2 v1v1 P=R Convex hull of a hypothetical SG Repeated SG Nash algorithm→result  Compute R=friend 1, L=friend 2 and attack 1, attack 2 strategies  Find egalitarian point and its policy If R is left of egalitarian line: P=R elseIf L is right of egalitarian line: P = L Else egalSearch(R,L,T) L R P=L R L folk folkEgal(U1,U2, ε )

What is “good” multiagent learning 31 Complexity  The algorithm involves solving MDPs (polynomial time) and other steps that also take polynomial time. The algorithm is polynomial iff T is bounded by a polynomial. Result: Running time. Polynomial in: The discount factor (1 / (1 – γ) ); The approximation factor (1 / ε) Running time. Polynomial in: The discount factor (1 / (1 – γ) ); The approximation factor (1 / ε) Repeated SG Nash algorithm→result

What is “good” multiagent learning 32 SG version of the PD game $ $B$B$B$B $A$A$A$A A B AlgorithmAgent AAgent B security-VI46.5 mutual defection friend-VI46 mutual defection CE-VI46.5 mutual defection folkEgal88.8 mutual cooperatio with threat of defection Repeated SG Nash algorithm→experiments

What is “good” multiagent learning 33 Compromise game AB $B$B$B$B $A$A$A$A AlgorithmAgent AAgent B security-VI00attacker blocking goal friend-VI-20 mutual defection CE-VI suboptimal waiting strategy folkEgal78.7 mutual cooperaton (w=0.5) with treat of defection Repeated SG Nash algorithm→experiments

What is “good” multiagent learning 34 Asymmetric game AB $B$B$B$B $A$A$A$A $A$A$A$A AlgorithmAgent AAgent B security-VI00attacker blocking goal friend-VI-200 mutual defection CE-VI32.1 suboptimal mutual cooperation folkEgal37.2 mutual cooperaton with threat of defection Repeated SG Nash algorithm→experiments

Dipartimento di Elettronica e Informazione Thanks for your attention!