Derivative Action Learning in Games Review of: J. Shamma and G. Arslan, “Dynamic Fictitious Play, Dynamic Gradient Play, and Distributed Convergence to Nash Equilibria,” IEEE Transactions on Automatic Control, Vol. 50, no. 3, pp , March 2005
Overview The authors propose an extension of fictitious play (FP) and gradient play (GP) in which strategy adjustment is a function of both the estimated opponent strategy and its time derivative They demonstrate that when the learning rules are well-calibrated, convergence (or near- convergence) to Nash equilibria and asymptotic stability in the vicinity of equilibria can be achieved in games where static FP and GP fail to do so
Game Setup This paper addresses a class of two-player games in which each player selects an action a i from a finite set at each instance of the game according to his mixed strategy p i and experiences utility U(p 1,p 2 ) equal to his expected payoff plus some additional utility associated with playing a mixed strategy. The purpose of the entropy term is not discussed by the authors, but it may be there to avoid converging to inferior local maxima in the utility function. Actual payoff depends on the combined player actions a 1 and a 2, each randomly selected according to the mixed strategies p 1 and p 2.
Entropy function H() rewards mixed strategy Probability of selecting a 1 in 2-dimensional strategy space
Empirical Estimation and Best Response Player i’s strategy p i is, in general, mixed and exists within the simplex defined in m i space, where m i is the number of available actions to player i, by vertices corresponding to the available actions. Further, he adjusts his strategy by observing his opponent’s actions, formulating an empirical estimate of his opponent’s strategy q -i, and calculating the best mixed strategy in response. The adjusted strategy then will direct his next move.
Best Response Function The best response is defined by the authors to be the mixed strategy that maximizes expected payoff. The authors claim (without proof) that, for > 0, the utility-maximizing function is the logit function.
FP in Continuous Time The remaining discussion of Fictitious Play is conducted in the continuous time domain. This allows the authors to describe the system dynamics in terms of smooth differential equations, and player actions are equivalent to their mixed strategies. The discrete-time dynamics are then interpreted as stochastic approximations of continuous-time solutions to the differential equations. This transformation is discussed in [Benaim, Hofbauer and Sorin 2003] and, presumably in [Benaim and Hirsch 1996], though I have not seen the latter myself.
Achieving Nash Equilibrium Nash equilibria are reached at fixed points of the Best Response function. Convergence to fixed points occurs as the empirical frequency estimates converge to the actual strategies played.
Derivative Action FP (DAFP): Idealized Case – Exact DAFP Exact DAFP uses directly measured first order forecast of opponent strategy in addition to observed empirical frequency in order to calculate Best Response
Derivative Action FP (DAFP): Approximate DAFP Approximate DAFP uses estimated first order forecast of opponent strategy in addition to observed empirical frequency in order to calculate Best Response
Exact DAFP, Special Case: = 1 System Inversion – Each player seeks to play best response against current opponent strategy
Convergence with Exact DAFP in Special Case ( = 1)
Convergence with Noisy Exact DAFP in Special Case ( = 1) Suppose In words, the derivative of empirical frequencies is measurable to within some error. The authors prove that for any arbitrarily small >0, there exists a >0 such that if the measurement error (e 1, e 2 ) eventually remains within a -neighborhood of the origin, then the empirical frequencies (q 1, q 2 ) will remain within an -neighborhood of a Nash equilibrium. This suggests that, if a sufficiently accurate approximation of empirical frequency can be constructed, Approximate DAFP will converge to an arbitrary neighborhood of the Nash equilibria.
Convergence with Approximate DAFP in Special Case ( = 1)
Convergence with Approximate DAFP in Special Case ( = 1) (CONTINUED)
Simulation Demonstration: Shapley Game Consider the 2-player 3×3 game invented by Lloyd Shapley to illustrate non-convergence of fictitious play in general. Standard FP in Discrete Time (top) and Continuous Time (bottom)
Simulation Demonstration: Shapley Game Shapley Game with Approximate DAFP in Continuous Time with increasing : 1(top), 10(middle), 100(bottom) Another interesting thing here is that the players enter a correlated equilibrium, and their average payoff is higher than the expected Nash payoff. For the “modified” game, where player utility matrices are not identical, the strategies converge to theoretically unsupported values, illustrating a violation of the weak continuity requirement for i. This steady-state error can be corrected by setting the derivative gain according to the linearization-Routh- Hurwitz procedure noted earlier.
GP Review for 2-player Games Gradient Play: Player i adjusts his strategy by observing his own empirical action frequency and adding the gradient of his Utility, as determined by his opponent’s empirical action frequency
GP in Discrete Time
GP in Continuous Time
Achieving Nash Equilibrium Gradient Play:
Derivative Action Gradient Play Standard GP cannot converge asymptotically to completely mixed Nash equilibria because the linearized dynamics are unstable at mixed equilibria. Exact DAGP always enables asymptotic stability at mixed equilibria with proper selection of derivative gain. Under some conditions, Approximate DAGP also enables asymptotic stability near mixed equilibria. Approximate DAGP always ensures asymptotic stability in the vicinity of strict equilibria.
DAGP Simulation: Modified Shapley Game
Multiplayer Games Consider the 3-player Jordan game: The authors demonstrate that DAGP converges to the mixed Nash equilibrium.
Jordan Game Demonstration