Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Games Presented by: Aggelos Papazissis Alexandros Papapostolou.

Similar presentations


Presentation on theme: "Learning Games Presented by: Aggelos Papazissis Alexandros Papapostolou."— Presentation transcript:

1 Learning Games Presented by: Aggelos Papazissis Alexandros Papapostolou

2 Subject Correlated Equilibria Learning, Mutation & Long run equilibria Rational learning leading to NE Dynamic Fictitious play & Dynamic Gradient Play

3 Computing Correlated Equilibria in Multi-Player Games Nash Equilibrium (N.E.) is the standard notion of rationality in game theory Correlated Equilibrium (C.E.) is another competing notion of rationality, more general than N.E. Christos Papadimitriou Tim Roughgarden Advantages C.E vs. N.E.: 1.It is guaranteed to always exist 2.Is arises from simple and natural dynamics in sense that the N.E. does not 3.It can be found in polynomial time for any number of players and strategies by linear programming 4.Ensures Pareto Optimality within the set of correlated equilibria RepresentationSize(O(…))Pure NEMixed NECEOptimal CE Normal formns n LinearPPAD-completePP Graphicalns d+1 NP-completePPAD-completePNP-hard SymmetricNP-completePPAD-completePP AnonymousNP-hardPP Polymatrixn 2 * s 2 PPAD-completePNP-hard CircuitNP-complete CongestionPLS-completePNP-hard

4 The idea of Correlated Equilibria While the mixed N.E. is a distribution on the strategy space that is “uncorrelated”(independent distributions), a C.E. is general distribution over strategy profiles. Each player chooses his action according to his observation of the value of the same public signal. A strategy assigns an action to every possible observation a player can make. If no player would want to deviate from the recommended strategy (assuming the others don't deviate but obey),because is the best in expectation, the distribution is called a correlated equilibrium.

5 An example of C.E. Dare-Chicken Out DC D0, 07, 2 C2, 76, 6 5 distributions that are C.E. 01 00 00 10 01/2 0 01/3 1/4 Pure N.E. Mixed N.E. Both players playing the mixed strategy {1/2,1/2} (3.75,3.75) Traffic Light (4.5,4.5) A third party draws 3 cards:(C,C),(D,C),(C,D) D: don’t deviate C: the other plays C with ½ and D with ½. Daring:0(1/2)+7(1/2)=3.5 Chicken:2(1/2)+6(1/2)=4 Therefore, chooses C. Nobody wants to change strategy and so C.E. Expected payoff= 7(1/3)+2(1/3)+6(1/3)=5 Better of mixed N.E. Author: Iskander Karibzhanov

6 Computing Correlated Equilibria (Pmp’s) Existence Proof: Every game has a CE. Linear Programming Duality. (P)-(D) is infeasible Constructive Proof: Apply Ellipsoid Algorithm to Dual, which has polynomially many variables and exponentially many constraints, so as to reduce the number of dual constraints to polynomial and that number suffices to compute CE’s and furthermore an optimal CE. At each step of the ellipsoid algorithm we find violated convex combinations of the constraints of (D) by using Markov Chain computations. At the conclusion of the algorithm, we have a polynomial number of such combinations (the cuts of ellipsoid algorithm) that are themselves in feasible. We call them (D’) Solving this dual of this new linear program, (P’), gives us the required CE as a polynomial mixtures of products (pmp’s) To optimize the CE, we reduce the dimensiomality of linear programming. For each player, the strategy profiles of all opponents are divided into equivalence classes, in which player’s utility depends only on his own choice. Main Result : Every game has a CE that is a convex combination of product distributions, whose is polynomial in n(players) and m(strategies)

7 Computing Optimal Correlated Equilibria The algorithm provides a refinement of a polynomial CE scheme that is guaranteed to sample the strategy space of the given game G according to an optimal CE. It can be formulated as a linear program, the dual of which can be solved in polynomial time via the ellipsoid method. The ellipsoid method will only generate a polynomial number of dual constraints during its execution and these constraints suffice to compute an optimal dual solution The primal variables (strategy profiles) that correspond to these dual constraints then suffice to solve the primal problem optimally. Since this “reduced primal” has a polynomial number of variables and constraints, it can be solved in polynomial time, yielding an optimal CE.

8 Learning, Mutation, and Long Run Equilibria in Games Appliance in 2x2 Symmetric Games Natural Selection of Long-Run Equilibria through learning, bounded rationality and mutation 3 Hypothesis: 1. Inertia Hypothesis:Not all agents need to react instantaneously to their environment 2. Myopia Hypothesis:Players react myopically 3. Mutation Hypothesis:Some agents may change their strategies at random

9 A Symmetric 2x2 Game II I s1s2 s12,20,0 s20,01,1 Choice of computer system in a small community: Ten students, using one of the two systems s1,s2. They randomly meeting each other and when two are in the same computer they collaborate. s1 superior of s2 2 N.E.: E1=(s1,s1) E2(s2,s2) Path dependence: If at least 4 students are initially using computer s1, then all students eventually end up using s1 Mutation: student leaves, the newcomer chooses s1 according to s1-users of outside world. Thus, with mutations we change the two equilibria E1 and E2(Darwinian adjustments) E1E1 E2E2 p P’ (p’/(p+p’),p/(p+p’)) As e->0 (1,0) E1 Pareto dominant Eq. Long-Run Eq. Takes 100000 periods to be upset while E2 takes 78 Assume mutation rate is small

10 Coordination Games 0123456 Critical level z* 0123456 (0) Stage 6:{3,4,5,6} Stage 0:{0,1,2} The least-cost 6-tree: 3 mutations (0) (3) The least-cost 0-tree: 4 mutations Conclusion: Stage 6, which has the larger basin of attraction, achieves the minimum among all states. The Eq. with the largest basin of attraction is always selected in the long run Thus, upsetting equilibria by large jumps is a natural consequence of independent mutations. (4) 0123456 (0) (2) 0123456 (0) (2) Less likely than immediate jumps. 4 mutations

11 Rational Learning leads to NE Nash equilibrium central concept Only once played games – not fully understood process Repeated interaction – enough time to observe opponents’ behavior – obtain statistical learning theory leading to NE Repeated play among a small number of subjectively rational agents Shortcomings of ‘myopic’ behavior of player 1. Ignores the fact that his opponents use also a dynamic learning process 2. Would not perform a costly experiment 3. Ignores strategic considerations regarding the future Theoretic approach to overcome these flaws 1. standard perfect monitoring infinitely repeated game with discounting 2. fixed matrix of payoff for every action combination by the group of players 3. discount factor that is used to evaluate future payoffs 4. maximize the present value of total expected payoff 5. need not have any info about opponents’ payoff matrix nor their rationality

12 Rational Learning leads to NE Main message of paper if players start with a vector of subjectively rational strategies if their individual subjective beliefs regarding opponents’ actions are compatible with the truly chosen strategies then they must converge in finite time to play according to an ε-NE (for small ε) Means they will learn to predict the future play of the game and play ε-NE Assumptions of model 1. Players’ objective to maximize their long term expected discounted payoff Learning is not a goal itself but rather a consequence of overall plans 2. Learning through Bayesian updating of individual prior beliefs 3. Not requiring full knowledge of each other’s strategies nor have known prior distributions on parameters of game 4. Independence of strategies Learning in myopic theories may never converge or may converge to false beliefs Dynamic approaches have the potential to overcome this difficulty Experimentation is evaluated according to long run contribution to expected utility A player chooses randomly when and how to experiment Optimal experimentation leads to an individually optimal solution

13 Examples and elaboration (1) Infinitely repeated Prisoner’s Dilemma Game PII (she) A Ca>b>c>d PI (he) A C  Discount parameter λ1 (0<λ1<1) to evaluate infinite streams of payoff  Knowledge of own parameters and perfect monitoring  Prior subjective beliefs probability distribution  Set of pure strategies g t, t = 0,1,2,…,∞ with g ∞ : the trigger strategy  If not triggered earlier, g t will prescribe unprovoked aggression strategy A starting from time t on  PI believes that PII is likely to cooperate by playing her grim trigger strategy but also he believes that are positive probabilities that she will stop cooperating earlier for some reasons, he assigns her strategies g 0,g 1,…,g ∞ probabilities β = (β 0,β 1,…,β ∞ ) that sum to 1 with β t > 0.  A best response strategy of form g T1 for T1 = 0,1,….,∞  PII holds similar beliefs ( vector α ) about PI’s strategy and chooses a strategy g T2 as her best response  Now the game will be played according to the two strategies (g t1, g t2 )  Learning to predict the future play must occur.  Players’ beliefs only converge to the truth as time goes ‚ ca db

14 Examples and elaboration (2) Learning and Teaching 2-person infinite symmetric version of “Chicken Game” Not passive state of learning Optimizers, who believe their opponents are open to learning, may find it interesting to act as teachers Simultaneously, at the beginning and with perfect monitoring after every history, each player chooses to yield (Y), or insist (I) PII  Y I PI Y I S t strategy : the one that prescribes the initial yielding at time t No dominant strategies & 2 symmetric pure strategy NE (S 0,S ∞ : he yields immediately and she insists forever) and (S ∞,S 0 ) Vectors α and β for PII’s and PI’s beliefs respectively PI may think (putting himself partially in her shoes) that she is equally likely to wait any number of the first n periods before yielding, to see if he would yield first or that she may insist forever with probability ε because (unlike him) she assigns a very large loss to ever yielding If future is important enough to PI, his best response to β would be to wait n periods in case she yields first, but if she does not, then yield himself at time n+1 As long as she is willing to find out about him, he will try to convince her by his actions that he is tough If both players adopt such reasoning, a pair of strategies (s T1,s T2 ) will be chosen If T1=0 and T2>0, there was no attempt to teach on the part of PI and the play is in some NE If 0<T1<T2, PI failed to teach her & no NE If T1>T2, we obtain no NE and he wins 0,01,2 2,1-1,-1

15 Main Results Theorem 1 Let f and f i n-vectors of strategies actually chosen and the beliefs of player i, respectively. If f is absolutely continuous (w.r.t. f i ), then for every ε >0 and for almost every play path z, there is a time T s.t. for all t>= T, f z(t) plays like ε-like f z(t) I 1. This theorem states that if the vector of strategies actually chosen is absolutely continuous, then the player will learn to accurately predict the future play of the game 2. The real probability of any future history cannot differ from the beliefs of player i by more than ε Theorem 2 Let f and f 1,f 2,…,f n be strategy vectors respectively the one actually played and the beliefs of the players. suppose that for every player i : 1) f i is a best response to f i -i 2) f is absolutely continuous w.r.t. f i then for every ε>0 and for almost all play paths z there is a time T=T (z, e) s.t. for every t>=T there exists an ε-equilibrium f^ of the repeated game satisfying that f z(t) plays ε-like f^ 1. In other words, given any ε >0,with probability 1 there will be some time T after which the players will play ε-like an ε-NE. 2. If utility maximizing players start with individual subjective beliefs, then in the long run their behavior must be essentially the same as a behavior described by an ε-NE

16 Dynamic fictitious play (intro) Best Response strategy Empirical frequencies : running average of opponent actions A player jumps to the best response to the empirical frequencies of the opponent Continuous-time form of a repeated game in which players continually update strategies in response to observations of opponent actions, without knowledge of opponent intentions A fictitious player does not have the ability to be forward looking, accurately anticipate her opponents’ play, or understand how her behavioral rule will affect her opponents’ responses. Primary objective : how interacting players could converge to a NE By playing the optimized best response to the observed empirical frequencies, the optimizing player will eventually converge to its own optimal response to the fixed strategy opponent If both players presumed that the other player is using a constant strategy, their update mechanisms become intertwined FICTITIOUS PLAY (FP) The repeated game would be in equilibrium if the empirical frequencies converged Will repeated play converge to a NE??? Results that establish convergence of FP 2-player / zero sum games 2-player / 2 move games noisy 2-player / 2 move games with a unique NE Noisy 2-player / 2 move games with countable NE 2-player games where one player has only 2 moves Empirical frequencies need not converge Shapley game (2 players/ 3 moves each) Jordan counterexample (3 players/ 2 moves each)

17 Dynamic fictitious play (FP setup) STATIC GAME each player p i selects a strategy and receives a real-valued reward according to the utility function u i (p i, p -i ) u 1 (p 1,p 2 ) = p T M 1 p 2 + τH(p 1 ) u 2 (p 2,p 1 ) = p T M 2 p 1 + ΤH(p 2 ) τH (p i ) : weighted entropy of strategy u i (p i, p * -i ) <= u i (p * i, p * -i ) p i * = β i (p * -i ) : best response Strategy of player P i at time k is the optimal response to the running average of the opponent’s actions p i (k) =β i (q -I (k)) DYNAMIC FP Derivative action FP can lead in some cases to behaviors converging to NE in previously non convergent situations q’ 1 (t) = β 1 (q 2 (t)) – q 1 (t) q’ 2 (t) = β 2 (q 1 (t)) – q 2 (t) Each player’s strategy is a best response to a combination of empirical frequencies and a weighted derivative of empirical frequencies p i (t) = β i (q -i (t) + γq’ -i (t)) Exact derivative action FP (exact DAFP) q’ 1 = β 1 (q 2 + γq’ 2 ) – q 1 q’ 2 = β 2 (q 1 + γq’ 1 ) – q 2 Approximate derivative action FP (approximate DAFP) q’ 1 = β 1 (q 2 + γλ (q 2 – r 2 )) – q 1, with λ>0 q’ 2 = β 2 (q 1 + γλ (q 1 – r 1 )) – q 2 Noisy derivative measurements q’ 1 = β 1 (q 2 + q’ 2 + e 2 ) – q 1 q’ 2 = β 2 (q 1 + q’ 1 + e 1 ) – q 2 Empirical frequencies converge to a neighborhood of the set of Nash equilibria (size depends on the accuracy of the derivative measurements)

18 Shapley “Fashion” Game (1) 2 players, each with 3 moves: {Red, Green, Blue} Player1: Fashion leader wants to differ from Player2 Player2: Fashion follower wants to copy Player1 Key assumption: Players do not announce preferences Daily routine: – Play game – Observe actions – Update strategies

19 Shapley “Fashion” Game (2) Empirical frequencies approach to the (unique) NE Empirical frequencies converge to wrong values As λ increases, the oscillations are progressively reduced Derivative action FP can be locally convergent when standard FP is not convergent

20 Dynamic Gradient Play Better Response strategy A player adjusts a current strategy in a gradient direction suggested by the empirical frequencies of the opponent Utility function u i (p i, p -i) = p i T M i p -i strategy of each player p i (t) = Π Δ [ q i (t) + Μ i q -i (t) ] A combination of a player’s own empirical frequency and a projected gradient – step using the opponent’s empirical frequencies q’ 1 (t) = Π Δ [ q 1 (t) + Μ 1 q 2 (t) ] - q 1 (t) q’ 2 (t) = Π Δ [ q 2 (t) + Μ 2 q 1 (t) ] – q 2 (t) Equilibrium points of continuous – time GP are precisely NE Gradient based evolution cannot converge to a completely mixed NE q’ 1 = Π Δ [ q 1 + Μ 1 (q 2 + γq’ 2 )] - q 1 q’ 2 = Π Δ [ q 2 + Μ 2 (q 1 + γq’ 1 )] – q 2 In (ideal) case of exact DAGP there always exist a γ, s.t. a completely mixed NE is locally asymptotically stable Stability of approximate DAGP may or may not be achieved 1. Completely mixed NE : never asymptotically stable under standard GP but derivative action can be enable convergence 2. Strict NE : approximate DAGP always results in locally stable behavior near (strict) NE

21 Dynamic Gradient Play The empirical frequencies converge to completely mixed NE

22 Jordan anti-coordination Game 3 players, each with 2 moves: {Left, Right} Player1 wants to differ from Player2 Player2 wants to differ from Player3 Player3 wants to differ from Player1 Players do not announce preferences Daily routine: – Play game – Observe actions – Update strategies Standard FP does not converge


Download ppt "Learning Games Presented by: Aggelos Papazissis Alexandros Papapostolou."

Similar presentations


Ads by Google