Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708 Machine Learning.

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708 Machine Learning Seminar Series

Overview  Markov Decision Processes (MDP) and Markov games.  Optimal policy search : Value iteration (VI), Policy iteration (PI), Reinforcement learning (RL).  Minimax-Q learning for zero-sum (ZS) games.  Quantitative analysis Minimax-Q and Q-learning algorithms for ZS games.  Cons of Minimax-Q and development of Nash-Q learning algorithm.  Constraints of Nash-Q and development of Friend-or-Foe Q-learning.  Brief discussion on Partially Observed Stochastic games POSG : POMDP for Multi-agent stochastic games.

MDP and Markov Games single agents operating in a fixed environment (“world”) It is represented by a tuple {S, A, T, R} The agent’s objective is to find the optimal strategy mapping to maximize the expected reward Multiple agents operating in an environment. It is represented by a tuple {S, A 1,…, A n, T, R 1,…, R n } Agent i’s objective is to maximize the expected reward T : S X A p(S) R : S X A R N : Length of horizon  : discount factor MDP T : S X A 1 x … x A n p(S) R i : S X A 1 x … x A n R Markov Games

0,01,-1-1,1 0,01,-1 -1,10,0 rock paper wood rock paper wood Opponent MG = {S, A 1,…, A n, T, R 1,…, R n } When |S|=1 (single-state), Markov Game is represented by Matrix games. When (single agent), Markov Game is represented by an MDP. For a two-player zero-sum (ZS) game, there is a single reward function with agents having diametrically opposite goals. For example, a two-player zero-sum matrix game as shown below : Agent

Optimal policy : Matrix Games 01 01 1 0 rock paper wood rock paper wood R o,a is the reward for the agent for taking action a with opponent taking action o. Agent strives to maximize the expected reward while opponent tries to minimize it. For the strategy   to be optimal it needs to satisfy i.e., find the strategy for the agent that has the best “worst-case” scenario.

Optimal policy : MDP & Markov Games There exists hosts of methods such as value iteration, policy iteration (both assumes complete knowledge of T), reinforcement learning (RL). Value iteration (VI) : Use of Dynamic-Programming to estimate value functions and convergence is guaranteed [Bertsekas, 1987]). MDPMarkov Games

Optimal policy : Markov Games MDPMarkov Games Note Every MDP has atleast one stationary, deterministic optimal policy. There may not exist an optimal stationary deterministic policy for MG. The reason being the agents uncertainty in guessing the opponents move exactly, specially when agents are making simultaneous moves, unlike tic-tac-toe etc.

Learning Optimal policy : Reinforcement Learning First developed by Watkins in 1989 for optimal policy learning in MDP without explicit knowledge of T. Agents receives a reward r while making transition from s to s’ by taking action a T(s,a,s’) is implicitly involved in the above state transition. Minimax-Q utilizes the same principle of Q-learning in the two-player ZS game. Via LP

Performance of Minimax-Q Software Soccer game on a 4x5 grid. 20 states and 5 actions {N, S, E, W and Stand}. Four different policies, 1.Minimax-Q trained against random opponent 2.Minimax-Q trained against Minimax-Q opponent 3.Q trained against random opponent 4. Q trained against random opponent

Constraints of Minimax-Q : Nash-Q Convergence is guaranteed only for two-player zero-sum games. Nash-Q, proposed by Hu & Wellman ’98, maintains a set of approximate Q functions and updates them as Note that Minimax-Q learning scheme was  – One-stage Nash Equilibrium policy of player k with current estimates of {Q 1,…,Q n }.

Check that there exists two situations for which no person can single-handedly change his/her action to increase their respective payoffs. NE : (opera, opera) and (fight, fight). For an n-player Normal form Game, represents the Nash equilibrium iff Two types of Nash equilibrium : coordinated and adversarial. All normal-form non-cooperative games have Nash equilibrium, some may be mixed-strategy. Single-stage Nash Equilibrium Let’s explain via a classic example, The Battle of the Sexes 1,20,0 2,1 Opera Fight Opera Fight Pat Chris 1,20,4 4,02,1

Analysis of Nash-Q Learning Why update in such a way ? For 1-player game (MDP), Nash-Q is simple maximization – Q-learning. For zero-sum games, Nash-Q is Minimax-Q – guaranteed convergence. Cons – Nash equilibrium is not unique (multiple equilibrium point exists), hence convergence is not guaranteed. Guaranteed to work when There exists an unique coordinated equilibrium for the entire game and for each game defined by the Q-functions during the entire learning. There exists an unique adversarial equilibrium for the entire game and for each game defined by the Q-functions during the entire learning. Or,

Relaxing constraints of Nash-Q : Friend-or-Foe Q Uniqueness of NE is relaxed in FFQ, but the algorithm needs to know the nature of the opponents : “friend” (coordinated equilibrium), or “foe” (adversarial equilibrium). There exists convergence guarantee for Friend-or-foe Q-leaning algorithm.

Analysis of FFQ Learning FFQ provides a RL-based strategy learning in multi-player general-sum games. Like Nash-Q, it should not be expected to find a Nash-equilibrium unless either coordinated or adversarial equilibrium exists. Unlike Nash-Q, FFQ does not require learning Q-estimates of all the players in the game, but only for its own. FFQ restrictions are much weaker : doesn’t require NE to be unique all along. Both FFQ and Nash-Q fails for games having no NE (infinite games).

Partial Observability of States : POSG Entire discussion assumed the states to be known, although the transition probability and reward functions can be learned asynchronously (RL). Partially Observed Stochastic Games (POSG) assumes the underlying states to be partially observed via observations. Stochastic Games are analogous to MDPs, so is learning via RL. POMDP can be interpreted as MDP over belief space, with increased complexity due to continuous belief space. But POSG cannot be solved by transforming it to stochastic games over belief spaces, since each agent’s belief is potentially different. E. A. Hansen et al. proposes a policy iteration approach for POSG that alleviates the scaling issue for a finite-horizon case via iterative elimination of dominated strategies (policies).

Summary Theory of MDP and Markov Games are strongly correlated. Minimax-Q learning is a Q-learning scheme proposed for two-player ZS games. Minimax-Q is very conservative in its action, since it chooses a strategy that maximizes the worst-case performance of the agent. Nash-Q is developed for multi-player, general-sum games but converges only under strict restrictions (existence and uniqueness of NE). FFQ relaxes the restrictions (uniqueness) a bit, but not much. Most algorithms are reactive i.e., each agents lets others to choose an equilibrium point and then learns its best response. In partial observability of states, the problem is not yet scalable.

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708 Machine Learning.

Similar presentations

Presentation on theme: "Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708 Machine Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708 Machine Learning.

Similar presentations

Presentation on theme: "Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC 27708 Machine Learning."— Presentation transcript:

Similar presentations

About project

Feedback