Download presentation
Presentation is loading. Please wait.
Published byDulcie Chapman Modified over 9 years ago
1
Software Multiagent Systems: CS543 Milind Tambe University of Southern California tambe@usc.edu
2
Dimensions of Multiagent Learning Ignore others’ learning vs Model others’ learning Cooperative vs Competitive Cooperative Learn to coordinate with others Learning organizational roles Competitive (conflicting learning goals) Learning to play better against adversary Opponent modeling We will focus on reinforcement learning: Q-learning methods
3
Some Terminology Q-learning Model-free vs Model-based
4
Q-learning Q-values: Q(s,a) Related to utility values: U(s) = max Q(s,a) Following equation must hold at equilibrium: Q(i,a) = R(i) + P(j|i,a)* max Q(j,a’) Requires learning a model! a Ja’
5
TD Q-learning Update equation for TD Q-learning is: Q(i,a) Q(i,a) + (R(i) + max Q( j, a’) – Q(i,a)) What if = 0? What if = 1? a’
6
Q Learning Agent Q-learning-agent (e) returns an action e is the percept Q: table of action values N: table of state-action frequencies a, the last action I, previous state 1. J state[e] 2.N[I,a] N[I,a] +1 3.Q[I,a] Q[I,a] + (R(i) + max Q(j,a’) – Q( I, a)) 4.I J 5.Return (action a’ that maximizes f (Q( j, a’), N[j,a’])) a’
7
Choosing an Action…. Step 5: choosing the best action to take in state J (a’ is the action chosen using f (Q(a’, j), N[a’, j])) Suppose all Q values initially zero, and f (Q(a’, j), N[a’, j]) chooses max Q(a’, j) Suppose after first exploration: Q[J,A1] = 10, Q[J,A2] = 0, Q[J,A3] = 0, Q[J,A4] = 0
8
Exploration vs Exploitation Tradeoff: immediate good (exploit) vs long-term (explore) Continuous exploration vs stuck to well-known path Key question: How to balance the two? One approach: – Give some weight to actions not tried often – Avoid actions that are of low utility
9
Exploration Giving “weight” action not tried very often f (Q(a’, j), N[a’, j])) = argmax G(Q(a’, j), N[a’, j])) G returns: very high “R” if N(a’, j) < N-VISITS otherwise Q (a’, j) What will be the result of such a function G? a’
10
Two Frameworks for Multiagent “Learning” DCOP: Exploration + Exploitation (paper to be posted on the web site) [Jain et al IJCAI’09] Stochastic games: Multiagent learning to reach N.E. (in our readings)
11
a2a3Reward 20 0 0 10 a1a2Reward 20 0 0 5 DCOP Framework a1 a2 a3 Assign values to distributed variables Optimize total reward No central control
12
DCOPS for Mobile Sensor Setworks (with Lockheed ATL)
13
New Challenges Reward matrices unknown Algorithms explore environment Maximize total cumulative signal strength Changes measuring of DCOP algorithms Limited time horizon Not explore everything Horizon-aware DCOPs
14
a2a3Reward 10 a1a2Reward 5 DCOP Framework: Reward Matrix Unknown a1 a2 a3 Assigning values to variables = Exploration Exploration takes time (physical movement) Limited time; full exploration impossible
15
Three New Algorithms Based on MGM (maximum gain message) Hill climbing Communicate possible gain to neighbors Agent with max gain “moves” a1 a2 a3 Gain =20 Gain =15 Proposed new algorithms: SE-optimistic: Unexplored domain values yield ‘maximum’ Optimistic: Maximal Potential Gain Messaging Exploration maximized: look for max value SE-mean: Unexplored domain values yield ‘mean’ reward “Realistic”: Limit exploration, satisfied by mean BE-backtrack : Lookahead given reward function distribution Intelligent: Decision theoretic limit on exploration
16
a2a3Reward 10 a1a2Reward 5 DCOP Framework: Reward Matrix Unknown a1 a2 a3 What if 20 is max reward? SE-optimistic: how will it work?
17
Lookahead Agent decides: ‘explore’ or ‘backtrack’ to explored state Let Rb be the best reward among explored states The agent will explore for T units only if EU(Explore) > EU(backtrack) Expected Utility of Backtrack: EU(backtrack) = Rb*T
18
Lookahead Expected utility of explore is calculated as: P(x,n,t e ) is the first order statistic of choosing the maximum as ‘x’ in t e trials E.U.(explore) is sum of three terms: utility of exploring utility of finding a better reward than current R b utility of failing to find a better reward than current R b
19
Sample Results (Jain et al, IJCAI’09) Decision theoretic approach to exploration Interleave with DCOPs
20
Towards Multiagent Learning: Stochastic Game Generalize distributed POMDPs Different payoffs for each player, not a common payoff Focus on two person stochastic game Learning algorithms for stochastic games
21
Stochastic 2-player Game States: S Action sets for each player: A1, A2 P transition probabilities: P(s’| s, a1, a2) R or Reward: two separate rewards: R1(s, a1,a2), depends on actions of all agents R2(s, a1, a2), depends on actions of all agents If R1(s,a1,a2) + R2(s,a1,a2) = 0, then zero sum game State observable (MDP like) Each player: maximize its own (discounted) sum of rewards
22
Stochastic Game R1(s0) R2(s0) R1(s1)R2(s1) R1(s2) R2(s2) P(s1|s0,a1,a2) P(s2|s0,a1,a2) Reward function depends on the state!
23
Stochastic Game How are repeated games related to stochastic games?
24
Stochastic Game Strategies = Policies Since rewards differ for each agent hence expected values differ as well v1(s, π1, π2) gives us the expected value for agent1 in state s, given that agents pursued policies π1, π2 Nash equilibrium in stochastic game: pair of strategies (π1*, π2*) such that for all states s v1(s, π1*, π2*) >= v1(s, π1, π2*) And v2(s, π1*, π2*) >= v2(s, π1*, π2)
25
Nash Equlibrium Policies In Stochastic games, we focus on policies that attain Nash equilibrium If we don’t find Nash equilibrium, then players may have an incentive to deviate Search for stability is critical Policies may be randomized; may not be deterministic
26
Example Stochastic game Goalee can move or stay Shooter can move or shoot Zero sum game, goal worth 10 points to shooter Blocking worth 5 points to goalee Cell1 GOAL!! Cell 3Cell 5 Cell 2 GOAL!! Cell 4Cell 6
27
Work out example
28
Q-learning in Stochastic Games Nash-Q algorithm: Q1(s, a1,a2) - Q value of agent1 for state S Q2(s, a1, a2) Q value of agent2 for state S Optimal Q values: Q1*(s, a1, a2) = R1(s, a1, a2) + λ Σ s’ P(s’|s,a1,a2)* V1(s’, π1*, π2*) Q2*(s, a1, a2) = R2(s, a1, a2) + λ Σ s’ P(s’|s,a1,a2)* V2(s’, π1*, π2*)
29
Example
30
Algorithm Consider two agents: Each agent maintains m Q-tables, m = number of states For each state, maintain |A1|*|A2| number of entries in the Q-table |A1| for my actions |A2| for other agents’ actions Q-tables for me and for the other agent
31
Key Observation State s’ Bimatrix representation: Q1[s’], Q2[s’] Defines a game Can find mixed strategy nash equilibrium for this game Mixed strategy Nash equilibrium: Provides probability distribution for what action to execute
32
Multiagent Q-Learning Initialize Q tables Loop: Choose action a1 based on π1(s), which is a mixed strategy Nash equilibrium of the game defined by (Q1(s), Q2(s)) Observe r1, r2, a2, s’ Update Q1(s) and Q2(s) using the equations defined below Q1(s,a1,a2) Q1(s, a1,a2)+ (R(s)+ λ[ Z1 ]– Q1(s,a1,a2)) Z1 = expected reward given N.E. in state s’ due to game Q1(s’),Q2(s’)
33
What do we end up with: Agents converging into the Nash equilibrium
34
Towards Multiagent Learning Learning “single agent” in a multiagent setting Ignore other agents except for some property like location Ignore that other agents act intentionally, adapt Advantages: Simpler Easily converges
35
Single Agent in Multiagent Setting RoboCup Soccer Simulation League Players use model-free reinforcement learning to intercept the ball Learn “on line” during the game
36
Finding #1: Online Learning Specialized by Opponent Same player position against two different RoboCup teams: Player 1 (forward) against CMUnited and Andhill Against CMUnited, player turns more aggressively
37
Finding #2: Online Learning Specialized by Role Same team against different players Player 1 (forward) and Player 10 (fullback) against CMUnited
38
Lessons Learned Surprise in tests against opponent teams: Significant specialization of intercept with both role & opponent Lesson: Transfer of experience or cross-training may be detrimental
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.