Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.

Slides:

Advertisements

Similar presentations

Vincent Conitzer CPS Repeated games Vincent Conitzer

Advertisements

A study of Correlated Equilibrium Polytope By: Jason Sorensen.

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.

Markov Decision Process

Continuation Methods for Structured Games Ben Blum Christian Shelton Daphne Koller Stanford University.

Congestion Games with Player- Specific Payoff Functions Igal Milchtaich, Department of Mathematics, The Hebrew University of Jerusalem, 1993 Presentation.

Joint Strategy Fictitious Play Sherwin Doroudi. “Adapted” from J. R. Marden, G. Arslan, J. S. Shamma, “Joint strategy fictitious play with inertia for.

Course: Applications of Information Theory to Computer Science CSG195, Fall 2008 CCIS Department, Northeastern University Dimitrios Kanoulas.

Learning in games Vincent Conitzer

Decision Theoretic Planning

The Evolution of Conventions H. Peyton Young Presented by Na Li and Cory Pender.

Markov Decision Processes

Planning under Uncertainty

Lecture 1 - Introduction 1.  Introduction to Game Theory  Basic Game Theory Examples  Strategic Games  More Game Theory Examples  Equilibrium  Mixed.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.

Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games XiaoFeng Wang and Tuomas Sandholm Carnegie Mellon University.

Markov Decision Processes

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Reinforcement Learning

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University.

Department of Computer Science Undergraduate Events More

UNIT II: The Basic Theory Zero-sum Games Nonzero-sum Games Nash Equilibrium: Properties and Problems Bargaining Games Bargaining and Negotiation Review.

1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.

Minimax strategies, Nash equilibria, correlated equilibria Vincent Conitzer

CPS Learning in games Vincent Conitzer

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Bayesian and non-Bayesian Learning in Games Ehud Lehrer Tel Aviv University, School of Mathematical Sciences Including joint works with: Ehud Kalai, Rann.

The Multiplicative Weights Update Method Based on Arora, Hazan & Kale (2005) Mashor Housh Oded Cats Advanced simulation methods Prof. Rubinstein.

Reinforcement Learning

1 Efficiency and Nash Equilibria in a Scrip System for P2P Networks Eric J. Friedman Joseph Y. Halpern Ian Kash.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Moshe Tennenholtz, Aviv Zohar Learning Equilibria in Repeated Congestion Games.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.

Algorithms for solving two-player normal form games

1 Ann Nowé Nature inspired agents to handle interaction in IT systems Ann Nowé Computational modeling Lab Vrije Universiteit Brussel.

Department of Computer Science Undergraduate Events More

Vincent Conitzer CPS Learning in games Vincent Conitzer

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Day 9 GAME THEORY. 3 Solution Methods for Non-Zero Sum Games Dominant Strategy Iterated Dominant Strategy Nash Equilibrium NON- ZERO SUM GAMES HOW TO.

OPPONENT EXPLOITATION Tuomas Sandholm. Traditionally two approaches to tackling games Game theory approach (abstraction+equilibrium finding) –Safe in.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Making complex decisions

Markov Decision Processes

Communication Complexity as a Lower Bound for Learning in Games

Vincent Conitzer CPS Repeated games Vincent Conitzer

Multiagent Systems Game Theory © Manfred Huber 2018.

Presented By Aaron Roth

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence Fall 2007

Multiagent Systems Repeated Games © Manfred Huber 2018.

Vincent Conitzer Repeated games Vincent Conitzer

Chapter 17 – Making Complex Decisions

Markov Decision Processes

Collaboration in Repeated Games

Normal Form (Matrix) Games

Markov Decision Processes

Vincent Conitzer CPS Repeated games Vincent Conitzer

Presentation transcript:

Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang

Correlated-Q Learning Greeenwald and Hall (2003) Setting: general sum Markov games Goal: convergence (reach equilibrium), payoff Means: CE-Q Results: empirical convergence in experiments Assumptions: observable reward, umpire for CE selection Strong? Weak? What do you think?

Markov Games State transitions only dependent on current state and action Q-values over states and action-vectors over agents Don’t always exist deterministic actions that maximize each agent’s rewards Each agent plays an action profile with a certain probability

Q-values Use Q values to find best action (in single player, argmax a..) In Markov game, can use Nash-Q, CE-Q, …, which use Q-values as the entries to a stage game and compute the equilibria. Play according to probabilities in the optimal strategy (your own part)

Nash equilibrium vs. Correlated equilibrium Nash Eq. vector of independent probability distributions over actions No unilateral deviation given everyone else is playing the equilibrium Correlated Eq. joint probability distribution (e.g. traffic light) No unilateral deviation given that others believe you are playing the equilibrium

Why CE? - Easily computable with linear programming -Higher rewards than Nash Equilibrium -No-regret algorithms converge to CE (Foster and Vohra) -Actions chosen independently (but based on commonly observed private signal)

LP to solve for CE These are constraints, need an objective

Multiple Equilibria There are many equilibria (can be much more than Nash!) Need a way to break ties Can ensure equilibrium value is the same (although maybe not equilibrium policy) 4 variants –Maximize the sum of players’ rewards (uCE-Q) –Maximin of the players rewards (eCE-Q) –Maximax of the players rewards (rCE-Q) –Maximize the maximum of each individual player (lCE-Q)

Experiments 3 grid games Exists both deterministic and nondeterministic equillibrium Q-values converged (in iterations) {u,e,r}CE-Q with best score performance (discount factor of 0.9) Soccer game Zero-sum, no deterministic eq. uCE (and others) still converges

Where are we? Some positive results, but highly enforced coordination Problem: multiplicity of equilibria Are these results useful for anything? Why should we care?

Cyclic Equilibria in Markov Games Zinkervich, Greenwald, Littman Setting: General sum Markov games Negative result: Q-values alone is insufficient to guarantee convergence Positive result: Can often get to cyclic equilibrium Assumptions: offline (what happened to learning?) How do we interpret these results? Why should we care?

Policy Stationary policy - set distribution for state, action vector pairs Non-stationary policy - a sequence of policies played at each iteration Cyclic policy - a non-stationary policy that is cyclic

Policy Stationary policy - set distribution for state, action vector pairs Non-stationary policy - a sequence of policies played at each iteration Cyclic policy - a non-stationary policy that is cyclic

No deterministic stationary eq.

NoSDE game (nasty) Turn-taking game No deterministic stationary policy Every NoSDE game has a unique nondeterministic stationary equilibrium policy Negative result For any NoSDE game, there exists another NoSDE game (differing in only rewards) with its own stationary policy such that the Q values are equal but the policies are different and the values are different. How do we interpret this?

Cyclic Equilibria Cyclic correlated equilibrium: a cyclic policy that is a correlated equilibrium CE: for any round in the cycle, playing based on observed signal has higher value (based on Q’s) than deviating. Can use value iteration to derive cyclic CE

Value Iteration 1.Use V’s from last iteration to update current Q’s 2.Compute policy using f(Q) 3.Update current V’s using current Q’s

GetCycle 1.Run value iteration 2.Find minimal distance between final round V T and any other round (that is less than maxCycles away), where distance is max difference between any state 3.Set the policies to the the policies between these two rounds

Facts (?)

Theorems Theorem 2: Given selection rule uCE, for every NoSDE game, there exists a cyclic CE Theorem 3: Given selection rule uCE, for any NoSDE game, ValueIteration does not converge to the optimal stationary policy Theorem 4: Given the game in Figure 1, no equilibrium selection rule f converges to the optimal stationary policy. Strong? Weak? Which one?

Experiments Check convergence by running metric: Check if deterministic equilibria exist by enumerating over every deterministic policy and running policy evaluation for 1000 iterations to estimate V and Q.

Results Test on turn based game and small simultaneous games, reached Cyclic CE with uCE almost always. With 10 states and 3 actions in simultaneous games, no techniques converged

What does this all mean? How negative are the results? How do we feel about all the assumptions? What are the positive results? Are they useful? Why are cyclic equilibria interesting? What about policy iteration?

The End :)