Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Similar presentations


Presentation on theme: "Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel."— Presentation transcript:

1 Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel

2 Outline Stochastic Games and Markov Perfect Equilibria Bellman’s Operator as a Contraction Mapping Stochastic Approximation of a Contraction Mapping Application to Zero-Sum Markov Games Minimax-Q Learning Theory of Nash-Q Learning Empirical Testing of Nash-Q Learning

3 How do we model games that evolve over time ? Stochastic Games ! Current Game = State Ingredients: – Agents (N) – States (S) – Payoffs (R) – Transition Probabilities (P) – Discount Factor ( δ)

4 Example of a Stochastic Game 1,23,4 5,67,8 -1,2-3,4 -5,6-7,8 A B CD 0,0 -10,10 A B CDE Move with 30% probability when (B,D) Move with 50% probability when (A,C) or (A,D) δ = 0.9

5 Markov Game is a Generalization of… Repeated Games Markov Games Add States

6 Markov Game is a Generalization of… Repeated Games Markov Games Add States MDP Add Agents

7 Markov Perfect Equilibrium (MPE) Strategy maps states into randomized actions – π i : S Δ(A) No agent has an incentive to unilaterally change her policy.

8 Cons & Pros of MPEs Cons: – Can’t implement everything described by the Folk Theorems (i.e., no trigger strategies) Pros: – MPEs always exist in finite Markov Games (Fink, 64) – Easier to “search for”

9 Learning in Stochastic Games Learning is specially important in Markov Games because MPE are hard to compute. Do we know: – Our own payoffs ? – Others’ rewards ? – Transition probabilities ? – Others’ strategies ?

10 Learning in Stochastic Games Adapted from Reinforcement Learning: – Minimax-Q Learning (zero-sum games) – Nash-Q Learning – CE-Q Learning

11 Zero-Sum Stochastic Games Nice properties: – All equilibria have the same value. – Any equilibrium strategy of player 1 against any equilibrium strategy of player 2 produces an MPE. – It has a Bellman’s-type equation.

12 Bellman’s Equation in DP Bellman Operator: T Bellman’s Equation Rewritten:

13 Contraction Mapping The Bellman Operator has the contraction property: Bellman’s Equation is a direct consequence of the contraction.

14 The Shapley Operator for Zero-Sum Stochastic Games The Shapley Operator is a contraction mapping. (Shapley, 53) Hence, it also has a fixed point, which is an MPE:

15 Value Iteration for Zero-Sum Stochastic Games Direct consequence of contraction. Converges to fixed point of operator.

16 Q-Learning Another consequence of a contraction mapping: – Q-Learning converges ! Q-Learning can be described as an approximation of value iteration: – Value iteration with noise.

17 Q-Learning Convergence Q-Learning is called a Stochastic Iterative Approximation of Bellman’s operator: – Learning Rate of 1/t. – Noise is zero-mean and has bounded variance. It converges if all state-action pairs are visited infinitely often. (Neuro-Dynamic Programming – Bertsekas, Tsitsiklis)

18 Minimax-Q Learning Algorithm For Zero-Sum Stochastic Games Initialize your Q 0 (s,a 1,a 2 ) for all states, actions. Update rule: Player 1 then chooses action u 1 in the next stage s k+1.

19 Minimax-Q Learning It’s a Stochastic Iterative Approximation of Shapley Operator. It converges to a Nash Equilibrium if all state- action-action triplets are visited infinitely often. (Littman, 96)

20 Can we extend it to General-Sum Stochastic Games ? Yes & No. Nash-Q Learning is such an extension. However, it has much worse computational and theoretical properties.

21 Nash-Q Learning Algorithm Initialize Q 0 j (s,a 1,a 2 ) for all states, actions and for every agent. – You must simulate everyone’s Q-factors. Update rule: Choose the randomized action generated by the Nash operator.

22 The Nash Operator and The Principle of Optimality Nash Operator finds the Nash of a stage game. Find Nash of stage game with Q-factors as your payoffs. Payoffs for Rest of the Markov Game Current Reward

23 The Nash Operator Unkown complexity even for 2 players. In comparison, the minimax operator can be solved in polynomial time. (there’s a linear programming formulation) For convergence, all players must break ties in favor of the same Nash Equilibrium. Why not go model-based if computation is so expensive ?

24 Convergence Results If every stage game encountered during learning has a global optimum, Nash-Q converges. If every stage game encountered during learning has a saddle point, Nash-Q converges. Both of these are VERY strong assumptions.

25 Convergence Result Analysis The global optimum assumption implies full cooperation between agents. The saddle point assumption implies no cooperation between agents. Are these equivalent to DP Q-Learning and minimax-Q Learning, respectively ?

26 Empirical Testing: The Grid-world WORLD 1 Some Nash Equilibria

27 Empirical Testing: Nash Equilibria WORLD 2 All Nash Equilibria (97%) (3%)

28 Empirical Performance In very small and simple games, Nash-Q learning often converged even though theory did not predict so. In particular, if all Nash Equilibria have the same value Nash-Q did better than expected.

29 Conclusions Nash-Q is a nice step forward: – It can be used for any Markov Game. – It uses the Principle of Optimality in a smart way. But there is still a long way to go: – Convergence results are weak. – There are no computational complexity results.


Download ppt "Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel."

Similar presentations


Ads by Google