Download presentation
Presentation is loading. Please wait.
1
Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and learning for “good” strategies Enrique Munoz de Cote What is “good” multiagent learning?
2
What is “good” multiagent learning 2 The prescriptive non-cooperative agenda [Shoham et al. 07] We are interested in problems where an agent needs to interact in open environments integrated by other agents. What's a “good” strategy in this situation? Can the monkey find a “good” strategy? or does it need to learn? View: single agent perspective of the multiagent problem. Environment dependent.
3
What is “good” multiagent learning 3 Multiagent Reinforcement Learning Framework unknown world: learning known world: solving Single-agent Multiple agents MDPs matrix games Decision Theory, Planning stochastic games
4
Dipartimento di Elettronica ed Informazione What is “good” multiagent learning? 4 Game theory and multiagent learning: brief backgrounds Game theory Stochastic games Solution concepts Multiagent learning Solution concepts Relation to game theory
5
What is “good” multiagent learning Stochastic games (SG) SGs are good examples of how agents' behaviours depend on each other. Strategies represent the way agents' behave Strategies might change as a function of other strategies. Game theory mathematically captures behaviour in strategic situations. A B $ backgrounds→game theory
6
What is “good” multiagent learning A Computational Example: SG version of chicken actions: U, D, R, L, X coin flip on collision Semiwalls (50%) collision = -5; step cost = -1; goal = +100; discount factor = 0.95; both can get goal. SG of chicken [Hu & Wellman, 03] A B $ backgrounds→game theory
7
What is “good” multiagent learning Strategies on the SG of chicken A B $ Average expected reward: (88.3,43.7); (43.7,88.3); (66,66); (43.7,43.7); (38.7,38.7); (83.6,83.6) backgrounds→game theory
8
What is “good” multiagent learning Solution concept: equilibria backgrounds→game theory Definition [equilibrium]. An n-tuple of strategies (one for each agent) is an equilibrium point if -- No agent has an incentive to deviate from its current strategy -- http://xkcd.com/182
9
What is “good” multiagent learning A B $ Equilibrium values Average total reward on equilibrium: Nash (88.3,43.7) very imbalanced, inefficient (43.7,88.3) very imbalanced, inefficient (53.6,53.6) ½ mix, still inefficient Correlated ([43.7,88.3],[43.7,88.3]); Minimax (43.7,43.7); Friend (38.7,38.7) backgrounds→game theory Computationally difficult to find in general
10
What is “good” multiagent learning Repeated Games What if agents are allowed to play multiple times? Strategies: Can be a function of history Can be randomized Nash equilibrium still exists.
11
What is “good” multiagent learning Computing strategies for repeated SGs Complete information: solve Exact or approximate solutions Incomplete information: learn The environment (as perceived by the agent) is not Markovian Convergence is not guaranteed − Exceptions: zero-sum and team games Unwanted cycles and unpredicted behaviour appear There are algorithms for solving and learning that use the same successive approximations to the Bellman equations to derive solution policies.
12
What is “good” multiagent learning Learning equilibrium strategies in SGs Multiagent RL updates are based on the Bellman equations (just as RL): A value iteration (VI) algorithm solve for the optimal Q function Finding a solution via VI depends on the operator Eq{·} : How can multiagent RL learn any of those strategies?
13
What is “good” multiagent learning B $ Defining optimality the safest the one that minimizes the opponent's reward the one that maximizes the opponent's reward the socially stable one What’s A ’s optimal strategy? In an open environment, an optimal strategy is arguable and may be defined by several criteria. A
14
What is “good” multiagent learning Defining optimality: our criteria Optimality: should obtain close to maximum utility against other best response algorithms. Security: should guarantee a minimum lower bound utility. Simplicity: should be intuitive to understand and implement. Adaptivity: should learn how to behave optimal, and remain optimal (even if environment changes).
15
What is “good” multiagent learning Observation: Reinforcement Learning updates Q-learning converges to a BR strategy in MDPs Definition [best response]. A best response function BR(·) returns the set of all strategies that are optimal against the environment's joint strategy. example environment: only agents with fixed strategies backgrounds→multiagent RL observation 2: a learner's BR can be modified by a change in the environment's fixed strategy. observation 1: a learner's BR is optimal against fixed strategies.
16
Dipartimento di Elettronica ed Informazione What is “good” multiagent learning? 16 Social Rewards Shaping rewards and intrinsic motivations Leader and follower strategies Open questions Joint work with: Monica Babes Michael L. Littman
17
What is “good” multiagent learning Social rewards: hints from the brain We’re smart, but evolution doesn’t trust us to plan all that far ahead. Evolution programs us to want things likely to bring about what we need: taste -> nutrition pleasure -> procreation eye contact -> care generosity -> cooperation Social Rewards→motivations
18
What is “good” multiagent learning Is cooperation “pleasurable”? fMRI study during repeated prisoner’s dilemma showed that humans perceive: mutual cooperation “internal rewards” (activity in the brain’s reward center) defection + - Social Rewards→motivations
19
What is “good” multiagent learning Social Rewards: its telescoping effect Objective: change the behavior of the learner by influencing its early experience. Shaping rewards [Ng et al., 99] Intrinsic motivation [Singh et al., 04] Social rewards Social Rewards→snapshot
20
What is “good” multiagent learning Social Rewards: guiding learners to better equilibria Objective: change the behavior of the learner by influencing its early experience. Shaping rewards [Ng et al., 99] Intrinsic motivation [Singh et al., 04] Social rewards Social Rewards→snapshot
21
What is “good” multiagent learning Leader and follower reasoning [Littman and Stone, 01] A leader strategy is able to guide a best response learner. Assumption: the opponent will adapt to its decisions. A B $ In the example A is a leader and B is a follower A best response learner is a follower. Assumption: its behaviour doesn't hurt nobody. leaders followers Social Rewards→introduction
22
What is “good” multiagent learning Leader strategies Assumption: opponent is playing a best response. -10,-101,-1 center -1,10,0 wall centerwall Leader fixed strategies agent B agent A Matrix game of chicken. leader follower BR B (wall) = center R A (wall,center) = -1 BR B (center) = wall R A (center,wall) = 1 Social Rewards→introduction
23
What is “good” multiagent learning$ $B$B$B$B $A$A$A$A A B Leader mutual advantage strategies Easy to say way: compute convex hull. Easy to compute way: Compute attack and defence strategies. Compute mutual advantage strategy. Use attack strategy as threat to deviations. the SG version of the prisoner's dilemma [Munoz de Cote and Littman, 2008] One-shot Nash Mutual advantage Nash in the repeated game Social Rewards→introduction
24
What is “good” multiagent learning How can a learner be also a leader? We influence the best response learner's early experience with special shaping rewards called “social rewards” The learner starts as a leader If opponent is not a BR follower, the social shaping is washed away. Social Rewards→methodology
25
What is “good” multiagent learning Shaping Based on Potentials Idea: each state is assigned a potential Φ(s) [Ng et al, 1999], On each transition, utility is augmented with the difference in potential, Social Rewards→methodology
26
What is “good” multiagent learning 26 The Q+shaping algorithm Compute attack and defence strategies. Compute mutual advantage strategy For repeated matrix games use [Littman and Stone,2003] algorithm For repeated stochastic games use [Munoz de Cote and Littman, 2008] algorithm Compute the state values (potentials) for the mutual advantage strategy. Initialize the Q-table with the potential based function F(s,s’). The attack strategy as threat to deviations will teach BR learners better mutual advantage strategies. Theorem [Wiewiora 03] : shaping based on potentials has the same effect as initializing the Q function with the potential values. Q+shaping's main objective is to lead or follow, as appropriate Social Rewards→algorithm
27
Dipartimento di Elettronica ed Informazione What is “good” multiagent learning? 27 A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Joint work with: Michael L. Littman
28
What is “good” multiagent learning 28 Main Result Given a repeated stochastic game, return a strategy profile that is a Nash equilibrium (specifically one whose payo ff s match the egalitarian point) of the average payo ff repeated stochastic game in polynomial time. Concretely, we address the following computational problem: v1v1 v2v2 Convex hull of the average payoffs egalitarian line Repeated SG Nash algorithm→result
29
What is “good” multiagent learning 29 How? (the short story version) Compute minimax (security) strategies. Solve two linear programming problems. The algorithm searches for a point: egalitarian line v2v2 2 v1v1 where P Convex hull of a hypothetical SG P is the point with the highest egalitarian value. Repeated SG Nash algorithm→result
30
What is “good” multiagent learning 30 How? (the search for point P) egalitarian line v2v2 2 v1v1 P=R Convex hull of a hypothetical SG Repeated SG Nash algorithm→result Compute R=friend 1, L=friend 2 and attack 1, attack 2 strategies Find egalitarian point and its policy If R is left of egalitarian line: P=R elseIf L is right of egalitarian line: P = L Else egalSearch(R,L,T) L R P=L R L folk folkEgal(U1,U2, ε )
31
What is “good” multiagent learning 31 Complexity The algorithm involves solving MDPs (polynomial time) and other steps that also take polynomial time. The algorithm is polynomial iff T is bounded by a polynomial. Result: Running time. Polynomial in: The discount factor (1 / (1 – γ) ); The approximation factor (1 / ε) Running time. Polynomial in: The discount factor (1 / (1 – γ) ); The approximation factor (1 / ε) Repeated SG Nash algorithm→result
32
What is “good” multiagent learning 32 SG version of the PD game $ $B$B$B$B $A$A$A$A A B AlgorithmAgent AAgent B security-VI46.5 mutual defection friend-VI46 mutual defection CE-VI46.5 mutual defection folkEgal88.8 mutual cooperatio with threat of defection Repeated SG Nash algorithm→experiments
33
What is “good” multiagent learning 33 Compromise game AB $B$B$B$B $A$A$A$A AlgorithmAgent AAgent B security-VI00attacker blocking goal friend-VI-20 mutual defection CE-VI68.270.1suboptimal waiting strategy folkEgal78.7 mutual cooperaton (w=0.5) with treat of defection Repeated SG Nash algorithm→experiments
34
What is “good” multiagent learning 34 Asymmetric game AB $B$B$B$B $A$A$A$A $A$A$A$A AlgorithmAgent AAgent B security-VI00attacker blocking goal friend-VI-200 mutual defection CE-VI32.1 suboptimal mutual cooperation folkEgal37.2 mutual cooperaton with threat of defection Repeated SG Nash algorithm→experiments
35
Dipartimento di Elettronica e Informazione Thanks for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.