Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games XiaoFeng Wang and Tuomas Sandholm Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

An Introduction to Game Theory Part V: Extensive Games with Perfect Information Bernhard Nebel.

Advertisements

Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.

M9302 Mathematical Models in Economics Instructor: Georgi Burlakov 3.1.Dynamic Games of Complete but Imperfect Information Lecture

Markov Decision Process

Non myopic strategy Truth or Lie?. Scoring Rules One important feature of market scoring rules is that they are myopic strategy proof. That means that.

This Segment: Computational game theory Lecture 1: Game representations, solution concepts and complexity Tuomas Sandholm Computer Science Department Carnegie.

Congestion Games with Player- Specific Payoff Functions Igal Milchtaich, Department of Mathematics, The Hebrew University of Jerusalem, 1993 Presentation.

Tacit Coordination Games, Strategic Uncertainty, and Coordination Failure John B. Van Huyck, Raymond C. Battalio, Richard O. Beil The American Economic.

Joint Strategy Fictitious Play Sherwin Doroudi. “Adapted” from J. R. Marden, G. Arslan, J. S. Shamma, “Joint strategy fictitious play with inertia for.

1 Chapter 14 – Game Theory 14.1 Nash Equilibrium 14.2 Repeated Prisoners’ Dilemma 14.3 Sequential-Move Games and Strategic Moves.

Chapter 6 Game Theory © 2006 Thomson Learning/South-Western.

Copyright (c) 2003 Brooks/Cole, a division of Thomson Learning, Inc

EC941 - Game Theory Lecture 7 Prof. Francesco Squintani

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Extensive-form games. Extensive-form games with perfect information Player 1 Player 2 Player 1 2, 45, 33, 2 1, 00, 5 Players do not move simultaneously.

The Evolution of Conventions H. Peyton Young Presented by Na Li and Cory Pender.

Chapter 6 © 2006 Thomson Learning/South-Western Game Theory.

Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

A camper awakens to the growl of a hungry bear and sees his friend putting on a pair of running shoes, “You can’t outrun a bear,” scoffs the camper. His.

An Introduction to Markov Decision Processes Sarah Hickmott

Infinite Horizon Problems

Planning under Uncertainty

Rational Learning Leads to Nash Equilibrium Ehud Kalai and Ehud Lehrer Econometrica, Vol. 61 No. 5 (Sep 1993), Presented by Vincent Mak

Basics on Game Theory For Industrial Economics (According to Shy’s Plan)

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

An Introduction to Game Theory Part II: Mixed and Correlated Strategies Bernhard Nebel.

Communication Networks A Second Course Jean Walrand Department of EECS University of California at Berkeley.

Beyond selfish routing: Network Formation Games. Network Formation Games NFGs model the various ways in which selfish agents might create/use networks.

AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.

An Introduction to Game Theory Part III: Strictly Competitive Games Bernhard Nebel.

Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.

APEC 8205: Applied Game Theory Fall 2007

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

UNIT II: The Basic Theory Zero-sum Games Nonzero-sum Games Nash Equilibrium: Properties and Problems Bargaining Games Bargaining and Negotiation Review.

Extensive Game with Imperfect Information Part I: Strategy and Nash equilibrium.

Games in the normal form- An application: “An Economic Theory of Democracy” Carl Henrik Knutsen 5/

EC941 - Game Theory Francesco Squintani Lecture 3 1.

UNIT II: The Basic Theory Zero-sum Games Nonzero-sum Games Nash Equilibrium: Properties and Problems Bargaining Games Bargaining and Negotiation Review.

1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Mechanisms for Making Crowds Truthful Andrew Mao, Sergiy Nesterko.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Nash equilibrium Nash equilibrium is defined in terms of strategies, not payoffs Every player is best responding simultaneously (everyone optimizes) This.

Standard and Extended Form Games A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor, SIUC.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Dynamic Games & The Extensive Form

Game-theoretic analysis tools Tuomas Sandholm Professor Computer Science Department Carnegie Mellon University.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Game Theory: introduction and applications to computer networks Game Theory: introduction and applications to computer networks Lecture 2: two-person non.

Extensive Games with Imperfect Information

Topic 3 Games in Extensive Form 1. A. Perfect Information Games in Extensive Form. 1 RaiseFold Raise (0,0) (-1,1) Raise (1,-1) (-1,1)(2,-2) 2.

Lecture 5A Mixed Strategies and Multiplicity Not every game has a pure strategy Nash equilibrium, and some games have more than one. This lecture shows.

Beyond selfish routing: Network Games. Network Games NGs model the various ways in which selfish agents strategically interact in using a network They.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Beyond selfish routing: Network Games. Network Games NGs model the various ways in which selfish users (i.e., players) strategically interact in using.

1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.

Chapter 6 Extensive Form Games With Perfect Information (Illustrations)

5.1.Static Games of Incomplete Information

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Communication Complexity as a Lower Bound for Learning in Games

Multiagent Systems Game Theory © Manfred Huber 2018.

CS 188: Artificial Intelligence Fall 2007

Multiagent Systems Repeated Games © Manfred Huber 2018.

Normal Form (Matrix) Games

Presentation transcript:

Reinforcement Learning to Play an Optimal Nash Equilibrium in Coordination Markov Games XiaoFeng Wang and Tuomas Sandholm Carnegie Mellon University

Outline  Introduction  Settings  Coordination Difficulties  Optimal Adaptive Learning  Convergence Proof  Extension: Beyond Self-play  Extension: Beyond Team Games  Conclusion and Future Works

Coordination Games  Coordination Games: –A coordination game typically possesses multiple Nash equilibria, some of which might be Pareto dominated by some of the others. –Assumption: Players (self-interested agents) prefer Nash equilibria than any other steady states (for example, a best-response loop).  Objective: to play a Nash equilibrium which is not Pareto dominated by other Nash equilibria.  Why coordination games are important? –Whenever an individual agent cannot achieve its goal without interacting with others, coordination problems could happen. –Study on coordination games helps us to understand how to achieve win-win outcomes in interactions and avoid being stuck in undesirable equilibria. –Examples: Team games, Battle-of-the-sexes, and minimum-effort games.

Team Games  Team Games: –In a team game, agents receive the same expected rewards. –Team Games are the simplest form of coordination games  Why team games are important? –A team game can have multiple Nash equilibria. Only some of them are optimal. This captures the important properties of a general category of coordination games. Study on team games gives us an easy start without loss of important generalities.

Coordination Markov Games  Markov decision process: –Model environment as a set of states S. A decision-maker (agent) drives the changes of states to maximize the sum of its discounted long-term payoffs.  A coordination Markov game: –Combination of MDP and coordination games: A set of self-interested agents choose joint action a  A to determine the state transition so as to maximize their own profits. For example, Team Markov games.  Relation between Markov game and Repeated stage games: –A joint Q-function maps a state-joint action pair (s, a) to the tuple of the sum of discounted long-term rewards individual agents receive by taking joint action a at state s and then following a joint strategy . – Q(s,. ) can be viewed as a stage game in which agent i receives a payoff Q i (s, a) (a component of the tuple of Q(s, a)) with a joint action a being taken by all agents at state s. We call such a game as state game. –A Subgame Perfect Nash equilibrium (SPNE) of a coordination Markov game is composed of the Nash equilibria of a sequence of coordination state games.

Reinforcement Learning (RL)  Objective of reinforcement learning –Find a strategy  : S  A to maximize an agent’s discounted long-term payoffs without knowledge about environment model (rewarding structure and transition probability)  Model-based reinforcement learning –Learning rewarding structure and transition probability to compute Q- function.  Model-free reinforcement learning –Learning Q-function directly.  Learning policy: –Interleave learning with execution of learnt policy. –GLIE guarantees the convergence to an optimal policy for a single- agent MDP.

RL in a Coordination Markov Game  Objective –Without knowing game structure, an agent i is trying to find an optimal individual strategy  i : S  A i to maximize the sum of its discounted long- term payoffs.  Difficulties: –Two layers of learning (Learning of game structure and learning of strategy) are interdependent during the learning of a general Markov game: On one hand, strategy is determined over Q-function. On the other hand, Q-function is learnt with respect to the joint strategy agents take.  RL in team Markov games – Team Markov games simplify the learning problem: Off-policy learning of game structure, learning coordination over the individual state games. –In a team Markov game, the accumulation of individual agents’ optimal policies is an optimal Nash equilibrium for the game. –Although simple, more tricky than it appears to be.

Research Issues  How to play an optimal Nash equilibrium in an unknown team Markov game?  How to extend the results to a more general category of coordination stage game and Markov games?

Outline  Introduction  Settings  Coordination Difficulties  Optimal Adaptive Learning  Convergence Proof  Extension: Beyond Self-play  Extension: Beyond Team Games  Conclusion and Future Works

Setting:  Agents make decision independently and concurrently.  No communications between agents.  Agents independently receive reward signals with the same expected values  Environment model is unknown  Agents’ actions are fully observable  Objective: find an optimal joint policy  * : S   A i to maximize the sum of discounted long-term rewards.

Outline  Introduction  Settings  Coordination Difficulties  Optimal Adaptive Learning  Convergence Proof  Extension: Beyond Self-play  Extension: Beyond Team Games  Conclusion and Future Works

Coordination over a known game  A team may have multiple optimal NE. Without coordination, agents do not know how to play B0 B1 B2 A0 A1 A2 Claus and Boutilier’s stage game  Solutions: –Lexicographic conventions (Boutilier) Problem: Sometimes, mechanism designer unable or unwilling to impose orders. –Learning: Each agent treats others as nonstrategic players and best responds to the empirical distribution of others’ previous plays. E.g, Fictitious play, adaptive play Problem: The learning process may converge to a sub-optimal NE, usually a risk dominant NE

Coordination over an unknown game  Unknown game structure and noisy payoffs make coordination even more difficult. –Independently receiving noisy rewards, agents may hold different views of a game at a particular moment. In this case, even lexicographic convention does not work B0 B1 B2 A0 A1 A2 A B0 B1 B2 A0 A1 A2 B

Problems  Against a known game –By solving the game, agents can identify all the NE but do not know how to play. –By myopic play (learning), agents can learn to play a consistent NE which however may not be optimal.  Against an unknown game –Agents might not identify optimal NE before the game structure fully converges.

Outline  Introduction  Settings  Coordination Difficulties  Optimal Adaptive Learning  Convergence Proof  Extension: Beyond Self-play  Extension: Beyond Team Games  Conclusion and Future Works

Optimal Adaptive Learning  Basic ideas: –Over a known game: eliminate the sub-optimal NE and then use myopic play (learning) to learn to play. –Over a unknown game: estimate the NE of the game before the game structure converges. Interleave learning of coordination with learning of game structure.  Learning layers: –Learning of coordination: Biased Adaptive Play against virtual games. –Learning of game structure: Construction virtual games with  -bound over a model-based RL algorithm.

Virtual games  A virtual game (VG) is derived from a team state game Q(s,.) as follows: –If a is an optimal NE in Q(s,.), VG(s,a)=1. Otherwise, VG(s,a)=0.  Virtual games eliminate all the strict sub-optimal NE of the original games. This is nontrivial when the number of players are more than B0 B1 B2 A0 A1 A B0 B1 B2 A0 A1 A2

Adaptive Play  Adaptive play (AP): –Each agent has a limited memory size to hold m recent plays being observed. –To choose actions, an agent i randomly draws k samples (without replacement) to build up an empirical model of others’ joint strategy. For example, suppose that there exists an reduced joint action profile a -i (all but i’s individual actions) which appears in the samples for K(a -i ) times, agent i treats the probability of the action as K(a -i )/k. –Agent i chooses the action which best responds to this distribution.  Previous work (Peyton Young) shows that AP converges to a strict NE in any weakly acyclic game.

Weakly Acyclic Games and Biased Set  Weakly acyclic games (WAG): –In a weakly acyclic game, there exists a best-response path from any strategy profile to a strict NE. –Many virtual games are WAGs  However, not all VGs are WAGs. –Some VGs only have weak NE which does not constitute an absorbing state.  Weakly acyclic game w.r.t. a biased set (WAGB): –A game in which exist best-response paths from any profile to an NE in a set D (called biased set) B0 B1 B2 A0 A1 A B0 B1 B2 A0 A1 A2

Biased Adaptive Play  Biased adaptive play (BAP): –Similar to AP except that an agent biases its action selection when it detects that it is playing an NE in the biased set.  Biased rules: –For an agent i if its k samples contain the same a -i which has also been included in at least one of NE in D, the agent chooses its most recent best response to the strategy profile. For example, if B’s samples show that A keeps playing A0 and its most recent best response is B0, B will stick to this action.  Biased adaptive play guarantees the convergence to an optimal NE for any VG constructed over a team game with the biased set containing all the optimal NE.

Construct VG over an unknown game  Basic ideas: –Using a slowly decreasing bound (called  -bound) to find all optimal NE. Specifically, At a state s and time t, an joint action a is  -optimal for the state game in if Q t (s,a)+  t  max a’ Q t (s,a’). A virtual game VG t is constructed over these  -optimal joint actions. If lim t   t =0 and  t decreases slower than Q-function, VG t converges to VG. –Construction of  -bound depends on the RL algorithm used to learn the game structure. Over a model-based reinforcement learning algorithm, we prove that the following bound meets the condition: N b-0.5 for all 0<b<0.5, where N is the minimal number of samples made up to time t.

The Algorithm  Learning of coordination –For each state, construct VG t according to  -optimal actions. –Follow GLIE learning policy, use BAP to choose best-response actions over VG t with exploitation probability.  Learning of game structure –Use a model-based RL to update Q-function. –Update  -bound with the minimal number of sampling. Find  -optimal actions with the bound

Outline  Introduction  Settings  Coordination Difficulties  Optimal Adaptive Learning  Convergence Proof  Extension: Beyond Self-play  Extension: Beyond Team Games  Conclusion and Future Works

Flowchart of the Proof Theorem 1: BAP converges over WAGB Theorem 3: BAP with GLIE converges over WAGB Lemma 2: Nonstationary Markov Chain Lemma 4: Any VG is WAGB Theorem 5: Convergence rate of the model-based RL Theorem 6: VG can be learnt with  -bound w.p.1 Main Theorem: OAL converges to an optimal NE w.p.1

Model BAP as a Markov chain  Stationary Markov chain model: –State: An initial state is composed of m initial joint actions agents observed: h 0 =(a 1, a 2,…, a m ). The definition of other states is inductive: The successor state h’ of a state h is obtained by deleting the leftmost element and add in a new observed joint action at the leftmost side of the tuple. Absorbing state: (a,a,…,a) is an individual absorbing state if a  D or it is a strict NE. All individual absorbing states are clustered into a unique absorbing state. – Transition: The probability p h,h’ that a state h transits to h’ is positive if and only if the left most joint action a={a 1, a 2,…, a n } in h’ is composed of individual action a i which best responds to at least k samples in h. Since the distribution an agent takes to sample its memory is independent of time, the transition probability between any two states does not change with time. Therefore, the Markov chain is stationary.

Convergence over a known game  Theorem 1 Let L(a) be the shortest length of a best-response path from joint action a to an NE in D. L G =max a L(a). If m  k(L G +2), BAP over WAGB converges to either a NE in D or a strict NE w.p.1.  Nonstationary Markov Chain Model: –With GLIE learning policy, at any moment, an agent has a probability to do experimenting (exploring the actions other than the estimated best- response). The exploration probability is diminishing with time. Therefore, we can model BAP with GLIE over WAGB as a nonstationary Markov chain, with a transition matrix P t. Let P be the transition matrix of the stationary Markov chain for BAP over the same WAGB. Clearly, GLIE guarantees that P t  P with t .  In stationary Markov chain model, we have only one absorbing state (composed of several individual absorbing states). Theorem 1 says that such a Markov chain is ergodic, with only one stationary distribution, given m  k(L G +2). With nonstationary Markov chain theory, we can get the following Theorem:  Theorem 2 With m  k(L G +2), BAP with GLIE converges to either a NE in D or a strict NE w.p.1.

Determine the length of best-response path  In a team game, L G is no more than n (the number of agents). The following figure illustrates this. In the figure, each box represents an individual action of an agent. represents an individual action contained in a NE. In the figure, we see that n-n’ agents can move the joint actions to an NE by switching their individual actions one after the other. This switching is best-response given others stick to their individual actions.  Lemma 4 The VG of any team game is a WAGB w.r.t. the set of optimal NE with L VG  n. … … …… n (number of agents) n’ (length of NE prefix) Non-NE strategy NE

Learning the virtual games  First, we assess the convergence rate of the model-based RL algorithm.  Then, we construct the sufficient condition for  -bound over the convergence rate lemma.

Main Theorem  Theorem 7 In any team Markov game among n agents if 1) m  k(n+1) 2)  -bound satisfies Lemma 6, then the OAL algorithm converges to an optimal NE w.p.1  General ideas of the proof: –With Lemma 6, we have that the probability of the event E that VG t =VG for the rest of play after time t converges to 1 with t. –Starting from a time t’, conditioning on the probability of E, agents play BAP with GLIE over a known game, which converges to an optimal NE w.p.1 according to Theorem 3. – Combine these two convergence process together, we get the convergence result.

Example: 2-agent game B0 B1 B2 A0 A1 A2

Example: 3-agent game A1 A2 A3 B1C1 B1C2 B1C3 B2C1 B2C2 B2C3 B3C1 B3C2 B3C3

Example: Multiple stage games

Outline  Introduction  Settings  Coordination Difficulties  Optimal Adaptive Learning  Convergence Proof  Extension: Beyond Self-play  Extension: Beyond Team Games  Conclusion and Future Works

Extension: general ideas  Classic game theory tells us how to solve a games, i.e., identifying the fixed points of introspections. However, it is less clear about how to play a game.  Standard ways to play a game: –Solve the game first and play a NE strategy (strategic play). Problem: 1) With existence of multiple NE, sometimes, agents may not know how to play. 2) It might be computationally expensive. –Assume that others take stationary strategy and best response to the belief (myopic play). Problem: Myopic strategies may lead agents to play a sub-optimal (Pareto dominated) NE.  The idea generalized from OAL: Partially Myopic and Partially Strategic (PMPS) play. –Biased Action Selection: Strategically lead the other to play a stable strategy. –Virtual Games: Compute NE first and then eliminate the sub-optimal NE. –Adaptive Play: Myopically adjust best-response strategy w.r.t. the agent’s observations.

Extension: Beyond self-play  Problem: –OAL only guarantees convergence to an optimal NE in self-play. That is, all players are OAL agents. Can agents find optimal coordination when only some of them play OAL? Let’s consider the simplest case: two agent, one is JAL or IL player (Claus and Boutilier 98) and the other is OAL player.  A straightforward way to enforce the optimal coordination: –Two players, one of them is an “opinionated” player who leads the play. Leader Learner –If the other is either JAL and IL player, the convergence to optimal NE is guaranteed. –How about that the other is also a leader agent? More seriously, how to play if the leader does not know the type of the other player? B0 B1 B2 A0 A1 A2

New Biased Rules  Original biased rules: –For an agent i if its k samples contain the same a -i which has also been included in at least one of NE in D, the agent chooses its most recent best response to the strategy profile. For example, if B’s samples show that A keeps playing A0 and its most recent best response is B0, B will stick to this action.  New biased rules: –If an agent i has multiple best-response actions w.r.t. its k samples, it chooses the one included in an optimal NE in VG. If there exists several such choices, it chooses the one which has been played most recently.  Difference between the old and the new rules: –Old rules biases the action-selection when others’ joint strategy has been included in an optimal NE. Otherwise, it just randomizes its choices of best-response actions. –The new rules always biases the agent’s action-selection.

Example  The new rules preserves the properties of convergence in n-agent team Markov games.

Extension: Beyond Team Games  How to extend the ideas of PMPS play to general coordination games? –To simplify the setting, now we consider a category of coordination stage games with the following properties: These games have at least one pure strategy NE. Agents have compatible preferences of some of these NE over any other steady states (such as mixed strategy NE or best-response loops). –Let’s consider two situations: Perfect monitoring and imperfect monitoring. Perfect monitoring: Agents can observe others’ actions and payoffs. Imperfect monitoring: Agents only observe others’ actions. –All agents may not have information about the game structure.

Perfect Monitoring  Following the same idea of OAL.  Algorithm: –Learning of coordination Compute all the NE of the game estimated. Find out all the NE being  dominated. For example, a strategy profile (a,b) is  dominated by (a’,b’) if (Q(a)<Q(a’)-  ) and (Q(b)  Q(b’)+  ). Construct a VG which contains all the NE not being  dominated, setting other values in VG to zero (without loss of generality, suppose that agents normalize their payoff to a value between zero and one). With GLIE exploration, BAP over the VG. –Learning of game structure Observe the others’ payoffs and update the sample means of agents’ expected payoffs in the game matrix. Compute an  -bound in the same way as OAL.  The learning over the coordination stage games we discussed is conjectured to converge to an NE not being Pareto dominated w.p.1

Imperfect Monitoring  In general, it is difficult to eliminate sub-optimal NE without knowing others’ payoffs. Let’s consider the simplest case: Two learning agents have at least one common interest (a strategy profile maximizes both agents’ payoffs).  For this game, agents can learn to play an optimal NE with a modified version of OAL (with new biased rules). –Biased rules: 1) Each agent randomizes its action-selection whenever the payoff of its best-response actions is zero over the virtual game. 2) Each agent biases its action to recent best response if all its k samples contain the same individual actions of the other agent, more than m-k recorded joint actions have this property and the agent have multiple best responses to give it payoff 1 w.r.t. to its k samples. Otherwise, randomly choose best-response action.  In this type of coordination stage game, the learning process is conjectured to converge to an optimal NE. The result can be extended to Markov game.

Example 1, 00, 00, 1 0,01,10,0 0,10,01,0 B0 B1 B2 A0 A1 A2

Conclusions and Future Works  In this research, we study RL techniques for agents to play an optimal NE (not being Pareto dominated by other NE) in coordination games when the environmental model is unknown beforehand.  We start our research with team game and propose the OAL algorithm, the first algorithm which guarantees the convergence to an optimal NE in any team Markov games.  We further generalize the basic ideas in OAL and propose a new approach for learning in games, called partially myopic and partially strategic play.  We extend the PMPS play beyond self-play and team games. Some of the results can be extended to Markov games.  In future research, we will further explore the application of PMPS play in coordination games. Especially, we will study how to eliminate sub-optimal NE in imperfect monitoring environments.