Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman.

Similar presentations


Presentation on theme: "Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman."— Presentation transcript:

1 Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman

2 Polytime Nash for repeated stochastic games 2 Main Result  Given a repeated stochastic game, return a strategy profile that is a Nash equilibrium (specifically one whose payo ff s match the egalitarian point) of the average payo ff repeated stochastic game in polynomial time. Concretely, we address the following computational problem: v1v1 v2v2 Convex hull of the average payoffs egalitarian line Main Result

3 Polytime Nash for repeated stochastic games 3 Framework Multiple states Single state Single agent Multiple agents MDPs matrix games Decision Theory, Planning stochastic games

4 Polytime Nash for repeated stochastic games 4 Stochastic Games (SG)  Superset of MDPs & NFGs  S is the set of states  T is the transition function Such that: backgrounds

5 Polytime Nash for repeated stochastic games A Computational Example: SG version of chicken  actions: U, D, R, L, X  coin flip on collision  Semiwalls (50%)  collision = -5;  step cost = -1;  goal = +100;  discount factor = 0.95;  both can get goal. SG of chicken [Hu & Wellman, 03] A B $ backgrounds

6 Polytime Nash for repeated stochastic games Strategies on the SG of chicken A B $ Average expected reward:  (88.3,43.7);  (43.7,88.3);  (66,66);  (43.7,43.7);  (38.7,38.7);  (83.6,83.6) backgrounds discount factor:.95

7 Polytime Nash for repeated stochastic games A B $ Equilibrium values Average total reward on equilibrium:  Nash (88.3,43.7) very imbalanced, inefficient (43.7,88.3) very imbalanced, inefficient (53.6,53.6) ½ mix, still inefficient  Correlated ([43.7,88.3],[43.7,88.3]);  Minimax (43.7,43.7);  Friend (38.7,38.7) Nash: computationally difficult to find in general backgrounds

8 Polytime Nash for repeated stochastic games Repeated Games  Many more equilibrium alternatives (Folk theorems)  Equilibrium strategies: Can depend on past interactions Can be randomized  Nash equilibrium still exists. What if players are allowed to play multiple times? v1v1 v2v2 Convex hull of the average payoffs backgrounds

9 Polytime Nash for repeated stochastic games Nash equilibrium of the repeated game [Folk theorems]. For any set of average payoffs that is, Strictly enforceable Feasible there exist equilibrium profile strategies that achieve these payoffs  Mutual advantage strategies: up and right of disagreement point v = (v 1,v 2 )  Threats: attack strategies against deviations v1v1 v2v2 v

10 Polytime Nash for repeated stochastic games Egalitarian equilibrium point Folk theorems conceptual drawback: infinitely many feasible and enforceable strategies egalitarian line v1v1 v2v2 Convex hull of the average payoffs P Egalitarian point. Maximizes the minimum advantage of the players’ rewards  Egalitarian line. line where payoffs are equally high above v v

11 Polytime Nash for repeated stochastic games 11 How? (the short story version)  Compute attack and defense strategies. Solve two linear programming problems.  The algorithm searches for a point: egalitarian line v2v2 2 v1v1 where P Convex hull of a hypothetical SG  P is the point with the highest egalitarian value. Repeated SG Nash algorithm→result

12 Polytime Nash for repeated stochastic games Game representation  Folk theorems can be interpreted computationally Matrix form [Littman & Stone, 2005] Stochastic game form [Munoz de Cote & Littman, 2008]  Define a weighted combination value:  A strategy profile ( π) that achieves σ w (p π ) can be found by modeling an MDP

13 Polytime Nash for repeated stochastic games Markov Decision Processes  We use MDPs to model 2 players as a meta-player Return: joint strategy profile that maximizes a weighted combination of the players’ payoffs  Friend solutions: (R 0, π 1 ) = MDP(1), (L 0, π 2 ) = MDP(0),  A weighted solution: (P, π) = MDP(w) v1v1 v2v2 v R0R0 L0L0 P

14 Polytime Nash for repeated stochastic games 14 The algorithm egalitarian line v2v2 2 v1v1 P=R Convex hull of a hypothetical SG Repeated SG Nash algorithm→result  Compute attack 1, attack 2, defense 1, defense 2 and R=friend 1, L=friend 2  Find egalitarian point and its strategy proflile If R is left of egalitarian line: P=R elseIf L is right of egalitarian line: P = L Else egalSearch(R,L,T) L R P=L R L folk FolkEgal(U1,U2, ε ) \ \ \

15 Polytime Nash for repeated stochastic games The key subroutine  Finds intersection between X and egalitarian line  Close to a binary search  Input: Point L (to the left of egalitarian line) Point R (to the right of egalitarian line) A bound T on the number of iterations  Return: The egalitarian point P (with accuracy ε )  Each iteration solves an MDP(w) by finding a solution to: EgalSearch(L,R,T)

16 Polytime Nash for repeated stochastic games 16 Complexity  Dissagreement point (accuracy ε ) : 1 / (1 – γ), 1 / ε, U max  MDPs are solved in polynomial time [Puterman, 1994]  The algorithm is polynomial iff T is bounded by a polynomial. Result: Running time. Polynomial in: The discount factor 1 / (1 – γ) ; The approximation factor 1 / ε; Magnitude of largest utility U max Running time. Polynomial in: The discount factor 1 / (1 – γ) ; The approximation factor 1 / ε; Magnitude of largest utility U max Repeated SG Nash algorithm→result

17 Polytime Nash for repeated stochastic games 17 SG version of the PD game $ $B$B$B$B $A$A$A$A A B AlgorithmAgent AAgent B security-VI46.5 mutual defection friend-VI46 mutual defection CE-VI46.5 mutual defection folkEgal88.8 mutual cooperation with threat of defection experiments

18 Polytime Nash for repeated stochastic games 18 Compromise game AB $B$B$B$B $A$A$A$A AlgorithmAgent AAgent B security-VI00attacker blocking goal friend-VI-20 mutual defection CE-VI68.270.1suboptimal waiting strategy folkEgal78.7 mutual cooperation (w=0.5) with treat of defection experiments A BBB B AA A

19 Polytime Nash for repeated stochastic games 19 Asymmetric game AB $B$B$B$B $A$A$A$A $A$A$A$A AlgorithmAgent AAgent B security-VI00attacker blocking goal friend-VI-200 mutual defection CE-VI32.1 suboptimal mutual cooperation folkEgal37.2 mutual cooperation with threat of defection experiments

20 Rutgers University Thanks for your attention!


Download ppt "Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman."

Similar presentations


Ads by Google