Reinforcement Learning Dynamics in Social Dilemmas

Slides:



Advertisements
Similar presentations
THE PRICE OF STOCHASTIC ANARCHY Christine ChungUniversity of Pittsburgh Katrina LigettCarnegie Mellon University Kirk PruhsUniversity of Pittsburgh Aaron.
Advertisements

Evolution and Repeated Games D. Fudenberg (Harvard) E. Maskin (IAS, Princeton)
Infinitely Repeated Games. In an infinitely repeated game, the application of subgame perfection is different - after any possible history, the continuation.
Chapter 14 Infinite Horizon 1.Markov Games 2.Markov Solutions 3.Infinite Horizon Repeated Games 4.Trigger Strategy Solutions 5.Investing in Strategic Capital.
1. Algorithms for Inverse Reinforcement Learning 2
Dynamic Games of Complete Information.. Repeated games Best understood class of dynamic games Past play cannot influence feasible actions or payoff functions.
Institutions and the Evolution of Collective Action Mark Lubell UC Davis.
Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.
Eponine Lupo.  Game Theory is a mathematical theory that deals with models of conflict and cooperation.  It is a precise and logical description of.
1 Finite-Length Scaling and Error Floors Abdelaziz Amraoui Andrea Montanari Ruediger Urbanke Tom Richardson.
Planning under Uncertainty
ODE and Discrete Simulation or Mean Field Methods for Computer and Communication Systems Jean-Yves Le Boudec EPFL MLQA, Aachen, September
6/2/2001 Cooperative Agent Systems: Artificial Agents Play the Ultimatum Game Steven O. Kimbrough Presented at FMEC 2001, Oslo Joint work with Fang Zhong.
XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.
APEC 8205: Applied Game Theory Fall 2007
Finite Mathematics & Its Applications, 10/e by Goldstein/Schneider/SiegelCopyright © 2010 Pearson Education, Inc. 1 of 60 Chapter 8 Markov Processes.
On Bounded Rationality and Computational Complexity Christos Papadimitriou and Mihallis Yannakakis.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Segregation and Neighborhood Interaction Work in progress Jason Barr, Rutgers Newark Troy Tassier, Fordham October 31, 2006.
Learning dynamics,genetic algorithms,and corporate takeovers Thomas H. Noe,Lynn Pi.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 6: Probability: What are the Chances? Section 6.1 Randomness and Probability.
MAKING COMPLEX DEClSlONS
Mean Field Methods for Computer and Communication Systems Jean-Yves Le Boudec EPFL ACCESS Distinguished Lecture Series, Stockholm, May 28,
Decentralised load balancing in closed and open systems A. J. Ganesh University of Bristol Joint work with S. Lilienthal, D. Manjunath, A. Proutiere and.
Derivative Action Learning in Games Review of: J. Shamma and G. Arslan, “Dynamic Fictitious Play, Dynamic Gradient Play, and Distributed Convergence to.
Dynamic Games of complete information: Backward Induction and Subgame perfection - Repeated Games -
Presenter: Chih-Yuan Chou GA-BASED ALGORITHMS FOR FINDING EQUILIBRIUM 1.
Ch 9.8: Chaos and Strange Attractors: The Lorenz Equations
Motor Control. Beyond babbling Three problems with motor babbling: –Random exploration is slow –Error-based learning algorithms are faster but error signals.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Lesson Probability Rules. Objectives Understand the rules of probabilities Compute and interpret probabilities using the empirical method Compute.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Section 2 – Ec1818 Jeremy Barofsky
Asymptotic behaviour of blinking (stochastically switched) dynamical systems Vladimir Belykh Mathematics Department Volga State Academy Nizhny Novgorod.
Repeated Game Modeling of Multicast Overlays Mike Afergan (MIT CSAIL/Akamai) Rahul Sami (University of Michigan) April 25, 2006.
ECE-7000: Nonlinear Dynamical Systems 3. Phase Space Methods 3.1 Determinism: Uniqueness in phase space We Assume that the system is linear stochastic.
Goldstein/Schnieder/Lay: Finite Math & Its Applications, 9e 1 of 60 Chapter 8 Markov Processes.
Oligopoly and Game Theory Topic Students should be able to: Use simple game theory to illustrate the interdependence that exists in oligopolistic.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Evolution of Cooperation in Mobile Ad Hoc Networks Jeff Hudack (working with some Italian guy)
Mean Field Methods for Computer and Communication Systems Jean-Yves Le Boudec EPFL Network Science Workshop Hong Kong July
HISTORY VS EXPECTATIONS -- PAUL KRUGMAN PRESENTATION BY -KOSHA MODI -PALLAVI JINDAL.
CHAPTER 10: Introducing Probability
Contents of the Talk Preliminary Materials Motivation and Contribution
Game Theory and Cooperation
Sequential Algorithms for Generating Random Graphs
Sampling Distributions
Dynamic Games of Complete Information
Vincent Conitzer CPS Repeated games Vincent Conitzer
Game Theory in Wireless and Communication Networks: Theory, Models, and Applications Lecture 2 Bayesian Games Zhu Han, Dusit Niyato, Walid Saad, Tamer.
Game Theory in Wireless and Communication Networks: Theory, Models, and Applications Lecture 10 Stochastic Game Zhu Han, Dusit Niyato, Walid Saad, and.
CHAPTER 10: Introducing Probability
CS 188: Artificial Intelligence Fall 2007
Chapter 6: Probability: What are the Chances?
13. The Weak Law and the Strong Law of Large Numbers
Stability Analysis of Linear Systems
Vincent Conitzer Repeated games Vincent Conitzer
Chapter 14 & 15 Repeated Games.
Evolving cooperation in one-time interactions with strangers
CS 188: Artificial Intelligence Spring 2006
Chapter 14 & 15 Repeated Games.
Molly W. Dahl Georgetown University Econ 101 – Spring 2009
UNIT II: The Basic Theory
Markov Decision Processes
Collaboration in Repeated Games
Markov Decision Processes
13. The Weak Law and the Strong Law of Large Numbers
Vincent Conitzer CPS Repeated games Vincent Conitzer
Blockchain Mining Games
Presentation transcript:

Reinforcement Learning Dynamics in Social Dilemmas Luis R. Izquierdo & Segis Izquierdo

Outline of the presentation BM reinforcement model Macy and Flache’s SRE and SCE Self-Reinforcing Equilibrium (SRE): Challenges Self-Correcting Equilibrium (SCE): Challenges Motivation In-depth analysis of the dynamics of the model Analysis of the robustness of the model Conclusions

BM reinforcement model Reinforcement learners tend to repeat actions that led to satisfactory outcomes in the past, and avoid choices that resulted in unsatisfactory experiences. The propensity or probability to play an action is increased (decreased) if it leads to a satisfactory (unsatisfactory) outcome.

BM reinforcement model Player 2 Cooperate Defect Player 1 3 (R) 0 (S) 4 (T) 1 (P) The Prisoner’s Dilemma pC = Probability to cooperate pD = Probability to defect Aspiration Threshold: A = 2 Learning rate: l = 0.5

BM reinforcement model Partner’s Choice: C / D Aspiration Threshold Outcome [CC, DD, CD, DC] pa C / D Payoff [R, P, S, T] Stimulus T = 4 D+C- C+C+ R = 3 A = 2 P = 1 D-D- –1 <= Stimulus <= 1 S = 0 C-D+

BM reinforcement model Partner’s Choice: C / D Aspiration Threshold Outcome [CC, DD, CD, DC] pa C / D Payoff [R, P, S, T] Stimulus S = 0 P = 1 R = 3 D+C- T = 4 C+C+ D-D- C-D+ A = 2 st(D / T) = 1 pD ↑↑ pC ↓↓ st(C / R) = 0.5 pC ↑ pC ↑ st(D / P) = – 0.5 pD ↓ pC ↑ st(C / S) = – 1 pC ↓↓ pC ↓↓

BM reinforcement model Player 2 Cooperate Defect Player 1 pC ↑ pC ↓↓ The Prisoner’s Dilemma pC = Probability to cooperate Aspiration Threshold: A = 2

BM reinforcement model Partner’s Choice: C / D Aspiration Threshold Outcome [CC, DD, CD, DC] pa C / D Payoff [R, P, S, T] Stimulus If satisfactory, move towards 1 a proportion (l·st) of the remaining distance If unsatisfactory, move towards 0 a proportion (l·st) of the remaining distance

BM reinforcement model 1 Learning rate: l = 0.5 pC C+ \ D- T = 4 D+C- C+ \ D- st = 1 C- \ D+ R = 3 C+C+ C- \ D+ st = 0.5 A = 2 C+ \ D- P = 1 D-D- st = -0.5 S = 0 C-D+ C- \ D+ st = -1 n= 0 n= 1 n= 2 …

BM reinforcement model Player 2 Cooperate Defect Player 1 pC ↑ pC ↓↓ The Prisoner’s Dilemma pC = Probability to cooperate Aspiration Threshold: A = 2

BM reinforcement model Most likely move Player 1 = D (T) Player 2 = C (S) Player 1 = C (R) Player 2 = C (R) S = 0 P = 1 R = 3 D+C- T = 4 C+C+ D-D- C-D+ A = 2 Player 1 = D (P) Player 2 = D (P) Player 1 = C (S) Player 2 = D (T)

BM reinforcement model Expected motion S = 0 P = 1 R = 3 D+C- T = 4 C+C+ D-D- C-D+ A = 2

Fixed aspirations, P < A < R MACY M W and Flache A (2002) Learning Dynamics in Social Dilemmas. Proc. Natl. Acad. Sci. USA 99(3), 7229-7236. Fixed aspirations, P < A < R Floating aspirations

“We identify a dynamic solution concept, stochastic MACY M W and Flache A (2002) Learning Dynamics in Social Dilemmas. Proc. Natl. Acad. Sci. USA 99(3), 7229-7236. “We identify a dynamic solution concept, stochastic collusion, based on a random walk from a self-correcting equilibrium (SCE) to a self-reinforcing equilibrium (SRE). These concepts make much more precise predictions about the possible outcomes for repeated games.” “Rewards produce a SRE in which the equilibrium strategy is reinforced by the payoff, even if an alternative strategy has higher utility.” e.g.: PC1 = 1, PC2 = 1 with Ai < Ri Action/stimulus : C+C+; C+C+; C+C+; … PC1, PC2 : 1 1 ; 1 1 ; 1 1 ; …

BM reinforcement model Expected motion SRE S = 0 P = 1 R = 3 D+C- T = 4 C+C+ D-D- C-D+ A = 2

MACY M W and Flache A (2002) Learning Dynamics in Social Dilemmas. Proc. Natl. Acad. Sci. USA 99(3), 7229-7236. “A mix of rewards and punishments can produce a SCE in which outcomes that punish cooperation or reward defection (causing the probability of cooperation to decrease) balance outcomes that reward cooperation or punish defection (causing the probability of cooperation to increase).” E(∆ PC) = 0 Prob 0.6 0.2 E(∆ PC) = 0.6 × 0.2 – 0.4 × 0.3 = 0 0.3 Prob 0.4 “The SRE is a black hole from which escape is impossible. In contrast, players are never permanently trapped in SCE”

BM reinforcement model Expected motion SRE S = 0 P = 1 R = 3 D+C- T = 4 C+C+ D-D- C-D+ A = 2 SCE

SRE: Challenges “The SRE is a black hole from which escape is impossible” “A chance sequence of fortuitous moves can lead both players into a self-reinforcing stochastic collusion. The fewer the number of coordinated moves needed to lock-in SRE, the better the chances.” However… The cooperative SRE implies PC = 1, but according to the BM model, PC = 1 can be approached, but not reached in finite time! How can you “lock-in” if the SRE cannot be reached? Floating-point errors? Can we give a precise definition of SREs?

SCE: Challenges “The SCE obtains when the expected change of probabilities is zero and there is a positive probability of punishment as well as reward” E(∆ PC) = 0 But … Such SCEs are not always attractors of the actual dynamics.

SCE: Challenges Such SCEs are not always attractors of the actual dynamics. E(∆ PC) = 0 SCE

SCE: Challenges “The SCE obtains when the expected change of probabilities is zero and there is a positive probability of punishment as well as reward” E(∆ PC) = 0 But … Such SCEs are not always attractors of the actual dynamics! Apart from describing regularities in simulated processes, Can we provide a mathematical basis for the attractiveness of SCEs?

Outline of the presentation Motivation In-depth analysis of the dynamics of the model Formalisation of SRE and SCE Different regimes in the dynamics of the system Dynamics with HIGH learning rates (i.e. fast adaptation) Dynamics with LOW learning rates (i.e. slow adaptation) Validity of the expected motion approximation Analysis of the robustness of the model Conclusions

A definition of SRE SRE: an absorbing state of the system CC CD 1 C-D+ C+C+ Space of states PC1 D+D+ D+C- 1 DD DC If Si < Ai < Pi PC2 Not an event Not an infinite chain of events SREs cannot be reached in finite time, but the probability distribution of the states concentrates around them

A definition of SCE SCE of a system S: An asymptotically stable critical point of the continuous time limit approximation of its expected motion SRE SCE

E(∆ PC) = 0 SRE & SCE NOT AN SCE SRE & SCE Expected movement of the system in a Stag Hunt game parameterized as [ 3 , 4 , 1 , 0 | 0.5 | 0.5 ]2. The numbered balls show the state of the system after the indicated number of iterations in a sample run.

Three different dynamic regimes “By the ultralong run, we mean a period of time long enough for the asymptotic distribution to be a good description of the behavior of the system. The long run refers to the time span needed for the system to reach the vicinity of the first equilibrium in whose neighborhood it will linger for some time. We speak of the medium run as the time intermediate between the short run [i.e. initial conditions] and the long run, during which the adjustment to equilibrium is occurring.” (Binmore, Samuelson and Vaughan 1995, p. 10)

Short run (initial conditions) By the ultralong run, we mean a period of time long enough for the asymptotic distribution to be a good description of the behavior of the system The long run refers to the time span needed for the system to reach the vicinity of the first equilibrium in whose neighborhood it will linger for some time. Ultralong run We speak of the medium run as the time intermediate between the short run [i.e. initial conditions] and the long run, during which the adjustment to equilibrium is occurring Long run Medium run Short run (initial conditions) Trajectories in the phase plane of the differential equation corresponding to the Prisoner’s Dilemma game parameterised as [ 4 , 3 , 1 , 0 | 2 | l ]2, together with a sample simulation run ( l = 2−2 ). This system has a SCE at [ 0.37 , 0.37 ].

The ultralong run (SRE) Most BM systems –in particular all the systems studied by Macy and Flace (2002) with fixed aspirations– converge to an SRE in the ultralong run if there exits at least one SRE.

High learning rates (fast adaptation) Learning rate: l = 0.5 SRE The ultralong run (i.e. convergence to an SRE) is quickly reached. The other dynamic regimes are not clearly observed.

Low learning rates (slow adaptation) Ultralong run Learning rate: l = 0.25 The three dynamic regimes are clearly observed. Long run Medium run -> trajectories Medium run Long run -> SCE Ultralong run -> SRE Short run (initial conditions)

The medium run (trajectories) For sufficiently small learning rates and number of iterations n not too large (n·l bounded), the medium run dynamics of the system are best characterised by the trajectories in the phase plane. l = 0.3 n = 10 l = 0.03 n = 100 l = 0.003 n = 1000

The long run (SCEs) When trajectories finish in an SCE, the system will approach the SCE and spend a significant amount of time in its neighbourhood if learning rates are low enough and the number of iterations n is large enough (and finite). This regime is the long run. Long run

The ultralong run (SREs) Most BM systems – in particular all the systems studied by Macy and Flace (2002) with fixed aspirations– converge to a SRE in the ultralong run if there exits at least one SRE.

l = 2-1 Learning rate l = 2-7 Iteration (time)

The validity of mean field approximations The asymptotic (i.e. ultralong) behaviour of the BM model cannot be approximated using the continuous time limit version of its expected motion. Such an approximation can be valid over bounded time intervals but deteriorates as the time horizon increases. Ultralong run

Outline of the presentation Motivation In-depth analysis of the dynamics of the model Analysis of the robustness of the model Model with occasional mistakes (trembling hands) Renewed importance of SCEs Discrimination among different SREs Conclusions

Model with trembling hands l = 0.25 noise = 0 l = 0.25 noise = 0.01 MODEL WITH TREMBLING HANDS ORIGINAL MODEL

Model with trembling hands (no SREs) SREUP: SRE OF THE UNPERTURBED PROCESS The lower the noise, the higher the concentration around SREUPs.

Model with trembling hands Importantly, not all the SREs of the unperturbed process are equally robust to noise. One representative run of the system parameterised as [ 4 , 3 , 1 , 0 | 0.5 | 0. 5 ] with initial state [ 0.9 , 0.9 ], and noise εi = ε = 0.1. Without noise: Prob( SRE[1,1] ) ≈ 0.7 Prob( SRE[0,0] ) ≈ 0.3

Model with trembling hands Not all the SREs of the unperturbed process are equally robust to noise.

Outline of the presentation Motivation In-depth analysis of the dynamics of the model Analysis of the robustness of the model Conclusions

Conclusions Formalisation of SRE and SCE In-depth analysis of the dynamics of the model Strongly dependent on speed of learning Beware of mean field approximations Analysis of the robustness of the model Results change dramatically when small quantities of noise are added Not all the SREs of the unperturbed process are equally robust to noise

Reinforcement Learning Dynamics in Social Dilemmas Luis R. Izquierdo & Segis Izquierdo