Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman
Polytime Nash for repeated stochastic games 2 Main Result Given a repeated stochastic game, return a strategy profile that is a Nash equilibrium (specifically one whose payo ff s match the egalitarian point) of the average payo ff repeated stochastic game in polynomial time. Concretely, we address the following computational problem: v1v1 v2v2 Convex hull of the average payoffs egalitarian line Main Result
Polytime Nash for repeated stochastic games 3 Framework Multiple states Single state Single agent Multiple agents MDPs matrix games Decision Theory, Planning stochastic games
Polytime Nash for repeated stochastic games 4 Stochastic Games (SG) Superset of MDPs & NFGs S is the set of states T is the transition function Such that: backgrounds
Polytime Nash for repeated stochastic games A Computational Example: SG version of chicken actions: U, D, R, L, X coin flip on collision Semiwalls (50%) collision = -5; step cost = -1; goal = +100; discount factor = 0.95; both can get goal. SG of chicken [Hu & Wellman, 03] A B $ backgrounds
Polytime Nash for repeated stochastic games Strategies on the SG of chicken A B $ Average expected reward: (88.3,43.7); (43.7,88.3); (66,66); (43.7,43.7); (38.7,38.7); (83.6,83.6) backgrounds discount factor:.95
Polytime Nash for repeated stochastic games A B $ Equilibrium values Average total reward on equilibrium: Nash (88.3,43.7) very imbalanced, inefficient (43.7,88.3) very imbalanced, inefficient (53.6,53.6) ½ mix, still inefficient Correlated ([43.7,88.3],[43.7,88.3]); Minimax (43.7,43.7); Friend (38.7,38.7) Nash: computationally difficult to find in general backgrounds
Polytime Nash for repeated stochastic games Repeated Games Many more equilibrium alternatives (Folk theorems) Equilibrium strategies: Can depend on past interactions Can be randomized Nash equilibrium still exists. What if players are allowed to play multiple times? v1v1 v2v2 Convex hull of the average payoffs backgrounds
Polytime Nash for repeated stochastic games Nash equilibrium of the repeated game [Folk theorems]. For any set of average payoffs that is, Strictly enforceable Feasible there exist equilibrium profile strategies that achieve these payoffs Mutual advantage strategies: up and right of disagreement point v = (v 1,v 2 ) Threats: attack strategies against deviations v1v1 v2v2 v
Polytime Nash for repeated stochastic games Egalitarian equilibrium point Folk theorems conceptual drawback: infinitely many feasible and enforceable strategies egalitarian line v1v1 v2v2 Convex hull of the average payoffs P Egalitarian point. Maximizes the minimum advantage of the players’ rewards Egalitarian line. line where payoffs are equally high above v v
Polytime Nash for repeated stochastic games 11 How? (the short story version) Compute attack and defense strategies. Solve two linear programming problems. The algorithm searches for a point: egalitarian line v2v2 2 v1v1 where P Convex hull of a hypothetical SG P is the point with the highest egalitarian value. Repeated SG Nash algorithm→result
Polytime Nash for repeated stochastic games Game representation Folk theorems can be interpreted computationally Matrix form [Littman & Stone, 2005] Stochastic game form [Munoz de Cote & Littman, 2008] Define a weighted combination value: A strategy profile ( π) that achieves σ w (p π ) can be found by modeling an MDP
Polytime Nash for repeated stochastic games Markov Decision Processes We use MDPs to model 2 players as a meta-player Return: joint strategy profile that maximizes a weighted combination of the players’ payoffs Friend solutions: (R 0, π 1 ) = MDP(1), (L 0, π 2 ) = MDP(0), A weighted solution: (P, π) = MDP(w) v1v1 v2v2 v R0R0 L0L0 P
Polytime Nash for repeated stochastic games 14 The algorithm egalitarian line v2v2 2 v1v1 P=R Convex hull of a hypothetical SG Repeated SG Nash algorithm→result Compute attack 1, attack 2, defense 1, defense 2 and R=friend 1, L=friend 2 Find egalitarian point and its strategy proflile If R is left of egalitarian line: P=R elseIf L is right of egalitarian line: P = L Else egalSearch(R,L,T) L R P=L R L folk FolkEgal(U1,U2, ε ) \ \ \
Polytime Nash for repeated stochastic games The key subroutine Finds intersection between X and egalitarian line Close to a binary search Input: Point L (to the left of egalitarian line) Point R (to the right of egalitarian line) A bound T on the number of iterations Return: The egalitarian point P (with accuracy ε ) Each iteration solves an MDP(w) by finding a solution to: EgalSearch(L,R,T)
Polytime Nash for repeated stochastic games 16 Complexity Dissagreement point (accuracy ε ) : 1 / (1 – γ), 1 / ε, U max MDPs are solved in polynomial time [Puterman, 1994] The algorithm is polynomial iff T is bounded by a polynomial. Result: Running time. Polynomial in: The discount factor 1 / (1 – γ) ; The approximation factor 1 / ε; Magnitude of largest utility U max Running time. Polynomial in: The discount factor 1 / (1 – γ) ; The approximation factor 1 / ε; Magnitude of largest utility U max Repeated SG Nash algorithm→result
Polytime Nash for repeated stochastic games 17 SG version of the PD game $ $B$B$B$B $A$A$A$A A B AlgorithmAgent AAgent B security-VI46.5 mutual defection friend-VI46 mutual defection CE-VI46.5 mutual defection folkEgal88.8 mutual cooperation with threat of defection experiments
Polytime Nash for repeated stochastic games 18 Compromise game AB $B$B$B$B $A$A$A$A AlgorithmAgent AAgent B security-VI00attacker blocking goal friend-VI-20 mutual defection CE-VI suboptimal waiting strategy folkEgal78.7 mutual cooperation (w=0.5) with treat of defection experiments A BBB B AA A
Polytime Nash for repeated stochastic games 19 Asymmetric game AB $B$B$B$B $A$A$A$A $A$A$A$A AlgorithmAgent AAgent B security-VI00attacker blocking goal friend-VI-200 mutual defection CE-VI32.1 suboptimal mutual cooperation folkEgal37.2 mutual cooperation with threat of defection experiments
Rutgers University Thanks for your attention!