Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning in Games Georgios Piliouras. Games (i.e. Multi-Body Interactions) Interacting entities Pursuing their own goals Lack of centralized control Prediction?

Similar presentations


Presentation on theme: "Learning in Games Georgios Piliouras. Games (i.e. Multi-Body Interactions) Interacting entities Pursuing their own goals Lack of centralized control Prediction?"— Presentation transcript:

1 Learning in Games Georgios Piliouras

2 Games (i.e. Multi-Body Interactions) Interacting entities Pursuing their own goals Lack of centralized control Prediction?

3 Games n players Set of strategies S i for each player i Possible states (strategy profiles) S=×S i Utility u i :S→ R Social Welfare Q:S→ R Extend to allow probabilities Δ(S i ), Δ(S) u i (Δ(S))=E(u i (S)) Q(Δ(S))=E(Q(S)) (review)

4 Zero-Sum Games & Equilibria 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Rock Paper Scissors Nash: A product of mixed strategies s.t. no player has a profitable deviating strategy. 1/3 (review)

5 Why do we study Nash eq?  Nash eq. have a simple intuitive definition.  Nash eq. are applicable to all games.  In some classes of games, Nash eq. is reasonably good predictor of rational self-interested behavior (e.g. zero-sum games).  Even in general games, Nash eq. analysis seems like a natural, albeit optimistic, first step in understanding rational behavior.

6 Why is it optimistic? Nash eq. analysis presumes that agents can resolve issues regarding:  Convergence: Agent behavior will converge to a Nash.  Coordination: If there are many Nash eq, agents can coordinate on one of them.  Communication: Agents are fully aware of each other utilities/rationality.  Complexity: Computing a Nash can be hard even from a centralized perspective.

7 Today: Learning in Games Agent behavior is online learning algorithm/dynamic Input: Current state of environment/other agents (+ history) Output: Chosen (randomized) action Analyze the evolution of systems of coupled dynamics, as a way to predict interacting agent behavior. Advantages: Weaker assumptions. If dynamic converges  Nash equilibrium (may not converge) Disadvantages: Harder to analyze

8 Today: Learning in Games Agent behavior is online learning algorithm/dynamic Input: Current state of environment/other agents (+ history) Output: Chosen (randomized) action Class 1: Best (Better) Response Dynamics Class 2: No-regret dynamics (e.g. Weighted Majority/Hedge dynamic)

9 Start from arbitrary state ( S i ) Choose arbitrary agent i Agent i deviates to a best (better) response given the strategies of other. Advantages: Simple, widely applicable Disadvantages: No intelligence/learning Does this work? Best Response Dynamics (BR)

10 Congestion Games n players and m resources (“edges”) Each strategy corresponds to a set of resources (“paths”) Each edge has a cost function c e (x) that determines the cost as a function on the # of players using it. Cost experienced by a player = sum of edge costs xxxx 2x xx Cost(red)=6 Cost(green)=8

11 Potential Games A potential game is a game that exhibits a function Φ : S→ R s.t. for every s ∈ S and every agent i, u i (s i,s -i ) - u i (s) = Φ (s i,s -i ) - Φ (s) Every congestion game is a potential game: This implies that any such game has pure NE and that best response converges. Speed?

12 BR Cycles in Zero-Sum Games 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0 -1, 11, -1 0, 0-1, 1 1, -10, 0

13 Regret(T) in a history of T periods: total profit of algorithm total profit of best fixed action in hindsight - An algorithm is characterized as “no regret” if for every input sequence the regret grows sublinearly in T. [Blackwell 56], [Hannan 57], [Fundberg, Levine 94],… 10 01 No single action significantly outperforms the dynamic. No Regret Learning

14 No single action significantly outperforms the dynamic. 10 01 Weather Profit Algorithm 3 Umbrella 3 Sunscreen 1 No Regret Learning

15 The Multiplicative Weights Algorithm a.k.a. Hedge a.k.a. Weighted Majority [Littlestone Warmuth ’94, Freund Schapire ‘99] Pick s with probability proportional to (1-ε) total(s), where total(s) denotes cumulative cost in all past periods. Why is it regret minimizing? – Proof on the board.

16 BREAK

17 No Regret and Equilibria Do no-regret algorithms converge to Nash equilibria in general games? Do no-regret algorithms converge to other equilibria in general games?

18 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Nash: A probability distribution over outcomes, that is a product of mixed strategies s.t. no player has a profitable deviating strategy. Choose any of the green outcomes uniformly (prob. 1/9) Rock PaperScissors Rock Paper Scissors 1/3 Other Equilibrium Notions (review)

19 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Nash: A probability distribution over outcomes, s.t. no player has a profitable deviating strategy. Rock PaperScissors Rock Paper Scissors 1/3 Coarse Correlated Equilibria (CCE): Other Equilibrium Notions (review)

20 A probability distribution over outcomes, s.t. no player has a profitable deviating strategy. Rock PaperScissors Rock Paper Scissors Coarse Correlated Equilibria (CCE): 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Other Equilibrium Notions (review)

21 A probability distribution over outcomes, s.t. no player has a profitable deviating strategy. Rock PaperScissors Rock Paper Scissors Coarse Correlated Equilibria (CCE): 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Choose any of the green outcomes uniformly (prob. 1/6) Other Equilibrium Notions (review)

22 A probability distribution over outcomes, s.t. no player has a profitable deviating strategy even if he can condition the advice from the dist.. Rock PaperScissors Rock Paper Scissors Correlated Equilibria (CE): 0, 0-1, 11, -1 0, 0-1, 1 1, -10, 0 Choose any of the green outcomes uniformly (prob. 1/6) Other Equilibrium Notions Is this a CE? NO (review)

23 Other Equilibrium Notions Pure NE CE CCE (review)

24 No-regret & CCE A history of no-regret algorithms is a sequence of outcomes s.t. no agent has a single deviating action that can increase her average payoff. A Coarse Correlated Equilibrium is a probability distribution over outcomes s.t. no agent has a single deviating action that can increase her expected payoff.

25 No Regret and Equilibria Do no-regret algorithms converge to Nash equilibria in general games? Do no-regret algorithms converge to other equilibria in general games? Do no-regret algorithms converge to Nash equilibria in interesting games?

26 CCE in Zero-Sum Games In general games, CCE ⊇ conv(NE) Why? In zero-sum games, the marginals and utilities of CCE and NE agree Why? What does it imply for no-regret algs?

27 BREAK 2 Can learning beat NASH equilibria by an arbitrary factor?

28 CCE in Congestion Games Load balancing: n balls, n bins Makespan: Expected maximum latency over all links c(x)=x … …

29 CCE in Congestion Games Pure Nash Makespan: 1 c(x)=x … … 111

30 CCE in Congestion Games Mixed Nash Makespan: Θ(logn/loglogn) c(x)=x … … 1/n [Koutsoupias, Mavronicolas, Spirakis ’02], [Czumaj, Vöcking ’02]

31 CCE in Congestion Games Coarse Correlated Equilibria Makespan: Exponentially worse Ω ( √ n) c(x)=x … … [Blum, Hajiaghayi, Ligett, Roth ’08]

32 No-Regret Algs in Congestion Games Since worst case CCE can be reproduced by worst case no-regret algs, worst case no-regret algorithms do not converge to Nash equilibria in general.

33 (t) is the current state of the system (this is a tuple of randomized strategies, one for each player). Each player tosses their coins and a specific outcome is realized. Depending on the outcome of these random events, we transition to the next state. (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) (t+1) Infinite Markov Chains with Infinite States O(ε)

34 Problem 1: Hard to get intuition about the problem, let alone analyze. Let’s try to come up with a “discounted” version of the problem. Ideas?? (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) (t+1) Infinite Markov Chains with Infinite States O(ε)

35 Idea 1: Analyze expected motion. (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) (t+1) Infinite Markov Chains with Infinite States O(ε)

36 The system evolution is now deterministic. (i.e. there exists a function f, s.t. I wish to analyze this function (e.g. find fixed points). (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) E[ (t+1)] O(ε) E[ (t+1)]= f ( (t), ε ) Idea 1: Analyze expected motion.

37 Idea 2: I wish to analyze the MWA dynamics for small ε. Use Taylor expansion to find a first order approximation to f. (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) E[ (t+1)] O(ε) f ( (t), ε) = f ( (t), 0) + ε ×f ´( (t), 0) + O(ε 2 ) Problem 2: The function f is still rather complicated.

38 As ε→0, the equation specifies a vector on each point of our state space (i.e. a vector field). This vector field defines a system of ODEs which we are going to analyze. (Multiplicative Weights) Algorithm in (Potential) Games Δ(S) (t) f ( (t), ε)-f ( (t), 0) = f´( (t), 0) ε f´( (t), 0)

39 Deriving the ODE Taking expectations: Differentiate w.r.t. ε, take expected value: This is the replicator dynamic studied in evolutionary game theory.

40 Motivating Example c(x)=x

41 Motivating Example Each player’s mixed strategy is summarized by a single number. (Probability of picking machine 1.) Plot mixed strategy profile in R 2. Pure Nash Mixed Nash

42 Motivating Example Each player’s mixed strategy is summarized by a single number. (Probability of picking machine 1.) Plot mixed strategy profile in R 2.

43 Motivating Example Even in the simplest case of two balls, two bins with linear utility the replicator equation has a nonlinear form.

44 The potential function The congestion game has a potential function Let Ψ=E[Φ]. A calculation yields Hence Ψ decreases except when every player randomizes over paths of equal expected cost (i.e. is a Lyapunov function of the dynamics). [Monderer-Shapley ’96]. Analyzing the spectrum of the Jacobian shows that in “generic” congestion games only pure Nash are stable. [Kleinberg-Piliouras-Tardos ‘09]

45 Cyclic Matching Pennies (Jordan’s game) [Jordan ’93] H, T H, T H, T Nash Equilibrium ½, ½ ½, ½ ½, ½ Social Welfare of NE: 3/2 Profit of 1 if you mismatch opponent; 0 otherwise

46 Cyclic Matching Pennies (Jordan’s game) H, T H, T H, T Best Response Cycle Social Welfare of NE: 3/2 (H,H,T) [Jordan ’93] Profit of 1 if you mismatch opponent; 0 otherwise

47 Cyclic Matching Pennies (Jordan’s game) H, T H, T H, T Social Welfare of NE: 3/2 (H,H,T),(H,T,T) [Jordan ’93] Best Response Cycle Profit of 1 if you mismatch opponent; 0 otherwise

48 Cyclic Matching Pennies (Jordan’s game) H, T H, T H, T Social Welfare of NE: 3/2 (H,H,T),(H,T,T),(H,T,H),(T,T,H),(T,H,H),(T,H,T),(H,H,T) Social Welfare: 2 Best Response Cycle [Jordan ’93] Profit of 1 if you mismatch opponent; 0 otherwise

49 Cyclic Matching Pennies (Jordan’s game) H, T H, T H, T Social Welfare of NE: 3/2 (H,H,T),(H,T,T),(H,T,H),(T,T,H),(T,H,H),(T,H,T),(H,H,T) Social Welfare: 2 player i-1 player i HT H01 T10 payoff, i ∈ {0,1,2} Best Response Cycle [Jordan ’93]

50 Asymmetric Cyclic Matching Pennies H, T H, T H, T Social Welfare of NE: 3M/(M+1) < 3 (H,H,T),(H,T,T),(H,T,H),(T,T,H),(T,H,H),(T,H,T),(H,H,T) Social Welfare: M+1 player i-1 player i HT H01 TM0 payoff, i ∈ {0,1,2} 1/(M+1), M/(M+1) 1/(M+1), M/(M+1) 1/(M+1), M/(M+1) Best Response Cycle [Jordan ’93]

51 Replicator Dynamics Pr(player i plays H) mixed strategy profile

52 Replicator Dynamics utility of player i rate of growth of action H Survival of the fittest: The probability of an action increases iff it outperforms the current (mixed) strategy. Simple, well-studied dynamic with desirable properties: limit of weighted majority algorithm with step size  0 (e.g. no-regret) mixed strategy profile Pr(player i plays H)

53 Optimal Social Welfare SW = M+1 player i-1 player i HT H01 TM0 payoff, i ∈ {0,1,2}

54 Unique Nash: all play 1/(M+1); SW = 3 M/(M+1) << M+1 player i-1 player i HT H01 TM0 payoff, i ∈ {0,1,2}

55 Experiment 1: Starting on the diagonal we converge to Nash

56

57 Experiment 2: Starting off the diagonal we converge to 6-cycle Proofs?

58 As M increases, the region where the SW is not a Lyapunov, increases as well.

59 Example: M=3

60 Regions where SW is less than the SW(Nash) Proof Structure: SW is still a Lyapunov in these regions Example: M=3

61 Proof Structure: SW is still a Lyapunov in these regions Regions where SW is less than the SW(Nash)

62 Starting off the diagonal, we always escape these regions and we get trapped in the area with SW>SW(Nash)

63 New Lyapunov function implies convergence to theboundary Proof Idea: If SW > SW(Nash), then is a Lyapunov function

64 Step1: Partition the space into regions where the growth rates of all strategies have fixed signs. Step 2: Establish that trajectories cycle infinitely around restricted neighborhood of the boundary

65 Step 3: Specialized Lyapunov fns & “Stitching” arguments

66 The Game* is ON *by game I mean projects


Download ppt "Learning in Games Georgios Piliouras. Games (i.e. Multi-Body Interactions) Interacting entities Pursuing their own goals Lack of centralized control Prediction?"

Similar presentations


Ads by Google