Convergence, Targeted Optimality, and Safety in Multiagent Learning

Slides:



Advertisements
Similar presentations
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Advertisements

Chapter 17: Making Complex Decisions April 1, 2004.
Vincent Conitzer CPS Repeated games Vincent Conitzer
Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.
Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Mixed Strategies CMPT 882 Computational Game Theory Simon Fraser University Spring 2010 Instructor: Oliver Schulte.
C&O 355 Mathematical Programming Fall 2010 Lecture 12 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A.
Chapter 6 Game Theory © 2006 Thomson Learning/South-Western.
MIT and James Orlin © Game Theory 2-person 0-sum (or constant sum) game theory 2-person game theory (e.g., prisoner’s dilemma)
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
EC941 - Game Theory Lecture 7 Prof. Francesco Squintani
Regret Minimization and the Price of Total Anarchy Paper by A. Blum, M. Hajiaghayi, K. Ligett, A.Roth Presented by Michael Wunder.
Learning in games Vincent Conitzer
The Evolution of Conventions H. Peyton Young Presented by Na Li and Cory Pender.
Chapter 6 © 2006 Thomson Learning/South-Western Game Theory.
Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.
Algoritmi per Sistemi Distribuiti Strategici
An Introduction to Game Theory Part II: Mixed and Correlated Strategies Bernhard Nebel.
INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.
XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.
AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.
Stackelberg Scheduling Strategies By Tim Roughgarden Presented by Alex Kogan.
Rationality and information in games Jürgen Jost TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A AAA Max Planck.
An Introduction to Black-Box Complexity
1 Worst-Case Equilibria Elias Koutsoupias and Christos Papadimitriou Proceedings of the 16th Annual Symposium on Theoretical Aspects of Computer Science.
On Bounded Rationality and Computational Complexity Christos Papadimitriou and Mihallis Yannakakis.
Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.
EC941 - Game Theory Francesco Squintani Lecture 3 1.
Constraints in Repeated Games. Rational Learning Leads to Nash Equilibrium …so what is rational learning? Kalai & Lehrer, 1993.
Reinforcement Learning (1)
1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.
MAKING COMPLEX DEClSlONS
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Standard and Extended Form Games A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor, SIUC.
Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman.
Uri Zwick Tel Aviv University Simple Stochastic Games Mean Payoff Games Parity Games TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
MAIN RESULT: We assume utility exhibits strategic complementarities. We show: Membership in larger k-core implies higher actions in equilibrium Higher.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
OPPONENT EXPLOITATION Tuomas Sandholm. Traditionally two approaches to tackling games Game theory approach (abstraction+equilibrium finding) –Safe in.
Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.
 When playing two person finitely repeated games, do people behave like they are adapting a policy directly(Gradient Accent) or do they behave like they.
Constraint Satisfaction Problems and Games
A useful reduction (SAT -> game)
Yuan Deng Vincent Conitzer Duke University
Keep the Adversary Guessing: Agent Security by Policy Randomization
Prof. Dr. Holger Schlingloff 1,2 Dr. Esteban Pavese 1
A useful reduction (SAT -> game)
Reinforcement Learning (1)
Simultaneous Move Games: Discrete Strategies
Markov Decision Processes
Communication Complexity as a Lower Bound for Learning in Games
Vincent Conitzer CPS Repeated games Vincent Conitzer
The Subset Sum Game Revisited
Multiagent Systems Game Theory © Manfred Huber 2018.
Presented By Aaron Roth
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Uri Zwick Tel Aviv University
Equlibrium Selection in Stochastic Games
Multiagent Systems Repeated Games © Manfred Huber 2018.
Vincent Conitzer Repeated games Vincent Conitzer
Collaboration in Repeated Games
Normal Form (Matrix) Games
Vincent Conitzer CPS Repeated games Vincent Conitzer
Presentation transcript:

Convergence, Targeted Optimality, and Safety in Multiagent Learning Doran Chakraborty Peter Stone Learning Agent Research Group University of Texas, Austin TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Multiple Autonomous Agents Non-stationarity in environment Overlap of spheres of influence Ignoring other agents and treating every thing else as the environment can be sub-optimal Each agent needs to learn the behavior of other agents in its sphere of influence MULTIAGENT LEARNING

Multiagent Learning from a Game theoretic perspective Agents are involved in a repeated matrix game N-player N-action matrix game On each time step, each agent just sees the joint action and hence the payoffs for ever agent Is there any way for an agent to ensure certain payoffs (if not the best possible) against unknown opponents?

Contributions First Multiagent Learning Algorithm called Convergence with Model Learning and Safety (CMLeS) In a n-player n-action repeated game achieves Converges to Nash equilibrium with probability 1 in self play (Convergence) Against a set of memory bounded counterparts of memory size at most Kmax, converges to playing close to the best response with a very high probability (Targeted-optimality) Also holds for opponents which eventually become memory bounded Achieves it in the best reported time complexity Against every other unknown agent ensures the maximin payoff (Safety)

High level overview of CMLeS Try to coordinate to a Nash equilibrium assuming all other agents are CMLeS agents if all agents are CMLeS agents when other agents are not CMLeS agents Try to model the opponents as memory bounded with max memory size Kmax (plays MLeS) Convergence achieved if other agents are arbitrary if other agents are memory bounded with memory size Kmax Targeted Optimality achieved Safety achieved

A motivating example : Battle of Sexes Bob B S B 1,2 0,0 Alice S 0,0 2,1 Alice Bob 3 Nash equilibria 2 in pure strategies 1 in mixed (each player goes to its preferred event 2/3 times)

Assume Hypothesis H0 = Bob is a CMLeS agent.

p = 0 Assume the agents choose the mixed strategy Nash equilibrium Assume Hypothesis H0 = Bob is a CMLeS agent Assume the agents choose the mixed strategy Nash equilibrium Alice plays B with prob 1/3 while Bob plays B with prob 2/3 Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Np = 100 and εp = 0.1 Compute a schedule Np and εp p = 0

p = 0 Assume the agents choose the mixed strategy Nash equilibrium Assume Hypothesis H0 = Bob is a CMLeS agent Assume the agents choose the mixed strategy Nash equilibrium Alice plays B with prob 1/3 while Bob plays B with prob 2/3 Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Np = 100 and εp = 0.1 Compute a schedule Np and εp p = 0 Play your own part of the Nash strategy for Np episodes

Assume Hypothesis H0 = Bob is a CMLeS agent Assume the agents choose the mixed strategy Nash equilibrium Alice plays B with prob 1/3 while Bob plays B with prob 2/3 Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Np = 100 and εp = 0.1 Compute a schedule Np and εp p = 0,1,2,…… Play your own part of the Nash strategy for Np episodes Alice played a1 31% times and Bob played a1 65 % times NO any agent deviated by εp from its Nash strategy?

Signal Play according to a fixed behavior p = 0,1,2,.. NO YES YES Assume Hypothesis H0 = Bob is a CMLeS agent Assume the agents choose the mixed strategy Nash equilibrium Alice plays B with prob 1/3 while Bob plays B with prob 2/3 p = 0,1,2,.. Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Play according to a fixed behavior Signal Compute a schedule Np and εp Play your own part of the Nash strategy for Np episodes NO YES any agent deviated by εp from its Nash strategy? YES Check for Consistency?

Signal When Bob is a memory bounded agent p = 0,1,2,…… NO YES YES Assume Hypothesis H0 = Bob is a CMLeS agent When Bob is a memory bounded agent p = 0,1,2,…… Use a Nash equilibrium solver to compute a Nash strategy for Alice and Bob Signal Compute a schedule Np and εp Play a1 Kmax+1 times Play your own part of the Nash strategy for Np episodes C == 0 Play a1 Kmax times followed by another random action apart from a1 NO YES C++ C == 1 any agent deviated by εp from its Nash strategy? C > 1 Play a1 Kmax+1 times YES Check for Consistency? Reject H0 and play MLeS NO

Contributions of CMLeS First MAL algorithm that in a n-player n-action repeated game Converges to Nash equilibrium with probability 1 in self play (Convergence) Against a set of memory bounded counterparts of memory size at most Kmax, converges to playing close to the best response with a very high probability (Targeted-optimality) Also holds for opponents which eventually become memory bounded Achieves it in the best reported time complexity Against every other unknown agent ensures safety eventually (Safety)

How to play against memory bounded opponents? Play against memory bounded opponents can be modeled as a Markov Decision Process (MDP) Chakraborty and Stone (ECML’08) The adversary induces the MDP and hence known as Adversary Induced MDP (AIM) The state space of the AIM is all feasible joint histories of size K The transition and reward function of the AIM is determined by the opponent’s strategy Both K and opponent strategy unknown and hence needs to be figured out

Adversary Induced MDP (AIM) (B,B)(S,S) Time = t Alice plays action S Assume Bob is a memory bounded opponent with K=2

Adversary Induced MDP (AIM) (B,B)(S,S) Time = t Alice plays action S (S,S)(S,?) Assume Bob is a memory bounded opponent with K=2

Adversary Induced MDP (AIM) (B,B )(S,S) Time = t Probability with which Bob plays S for a memory of (B,B)(S,S) = 0.3 Probability with which Bob plays B for a memory of (B,B)(S,S) = 0.7 Reward = 2 Reward = 0 (S,S)(S,S) (S,S)(S,B) Optimal policy for this AIM is the optimal way of playing against Bob How to achieve it? Use MLeS

Flowchart of MLeS YES NO Is k a Start of episode t Compute the best estimate of K using FIND-K algorithm. Let that be k Run RMax assuming that the true underlying AIM is of size k Play the safety strategy YES NO Is k a valid value?

Flowchart of MLeS YES NO Is k a Start of episode t Compute the best estimate of K using FIND-K algorithm. Let that be k Run RMax assuming that the true underlying AIM is of size k Play the safety strategy YES NO Is k a valid value?

Find-K algorithm 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1 Figuring out the opponent memory size 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t amount of information lost by not modeling Bob as a k+1 memory sized opponent as opposed to a k memory sized opponent Δkt

Find-K algorithm 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1 Figuring out the opponent memory size 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t 0.05 0.001 0.01 0.4 0.3

Find-K algorithm 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1 Figuring out the opponent memory size 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t 0.05 0.001 0.01 0.4 0.3 σKmaxt σot σ1t σKt σK+1t 0.07 0.0001 0.0002 0.002 0.02

Picks K with prob at least 1- δ Find-K algorithm Figuring out the opponent memory size 0 1 2 3 4 … K K+1 K+2 Kmax Kmax+1 ΔKt ΔK+1t ΔKmaxt Δ0t Δ1t 0.05 0.001 0.01 0.4 0.3 σKmaxt σot σ1t σKt σK+1t 0.07 0.0001 0.0002 0.002 0.02 δ/Kmax δ/Kmax δ/Kmax Picks K with prob at least 1- δ

Theoretical properties of MLeS Find-K needs only polynomial number of visits to every feasible joint history of size K to find the true opponent memory size, or K, with probability at least 1 – δ Polynomial in 1/δ and Kmax Overall time complexity of computing a ε-best response against a memory bounded opponent is then polynomial in the size of feasible joint histories of size K, Kmax,1/δ and 1/ε For opponents which cannot be modeled as a Kmax memory bounded opponent, it converges to safety strategy with probability 1, in the limit

Conclusion and Future Work A new Multiagent learning algorithm CMLeS Convergence Targeted optimality against memory bounded adversaries in the best reported time complexity Safety What if there is a mixed population of agents? How to incorporate no-regret or bounded regret? Agents in graphical games