Software Multiagent Systems: CS543 Milind Tambe University of Southern California

Software Multiagent Systems: CS543 Milind Tambe University of Southern California tambe@usc.edu

Dimensions of Multiagent Learning  Ignore others’ learning vs Model others’ learning  Cooperative vs Competitive Cooperative Learn to coordinate with others Learning organizational roles Competitive (conflicting learning goals) Learning to play better against adversary Opponent modeling We will focus on reinforcement learning: Q-learning methods

Some Terminology Q-learning Model-free vs Model-based

Q-learning Q-values: Q(s,a) Related to utility values: U(s) = max Q(s,a) Following equation must hold at equilibrium: Q(i,a) = R(i) +   P(j|i,a)* max Q(j,a’) Requires learning a model! a Ja’

TD Q-learning Update equation for TD Q-learning is: Q(i,a)  Q(i,a) +  (R(i) + max Q( j, a’) – Q(i,a)) What if  = 0? What if  = 1? a’

Q Learning Agent Q-learning-agent (e) returns an action e is the percept Q: table of action values N: table of state-action frequencies a, the last action I, previous state 1. J  state[e] 2.N[I,a]  N[I,a] +1 3.Q[I,a]  Q[I,a] +  (R(i) + max Q(j,a’) – Q( I, a)) 4.I  J 5.Return (action a’ that maximizes f (Q( j, a’), N[j,a’])) a’

Choosing an Action…. Step 5: choosing the best action to take in state J (a’ is the action chosen using f (Q(a’, j), N[a’, j])) Suppose all Q values initially zero, and f (Q(a’, j), N[a’, j]) chooses max Q(a’, j) Suppose after first exploration: Q[J,A1] = 10, Q[J,A2] = 0, Q[J,A3] = 0, Q[J,A4] = 0

Exploration vs Exploitation Tradeoff: immediate good (exploit) vs long-term (explore) Continuous exploration vs stuck to well-known path Key question: How to balance the two? One approach: – Give some weight to actions not tried often – Avoid actions that are of low utility

Exploration Giving “weight” action not tried very often f (Q(a’, j), N[a’, j])) = argmax G(Q(a’, j), N[a’, j])) G returns: very high “R” if N(a’, j) < N-VISITS otherwise Q (a’, j) What will be the result of such a function G? a’

Two Frameworks for Multiagent “Learning” DCOP: Exploration + Exploitation (paper to be posted on the web site) [Jain et al IJCAI’09] Stochastic games: Multiagent learning to reach N.E. (in our readings)

a2a3Reward 20 0 0 10 a1a2Reward 20 0 0 5 DCOP Framework a1 a2 a3 Assign values to distributed variables Optimize total reward No central control

DCOPS for Mobile Sensor Setworks (with Lockheed ATL)

New Challenges Reward matrices unknown Algorithms explore environment Maximize total cumulative signal strength Changes measuring of DCOP algorithms Limited time horizon Not explore everything Horizon-aware DCOPs

a2a3Reward 10 a1a2Reward 5 DCOP Framework: Reward Matrix Unknown a1 a2 a3 Assigning values to variables = Exploration Exploration takes time (physical movement) Limited time; full exploration impossible

Three New Algorithms Based on MGM (maximum gain message) Hill climbing Communicate possible gain to neighbors Agent with max gain “moves” a1 a2 a3 Gain =20 Gain =15 Proposed new algorithms: SE-optimistic: Unexplored domain values yield ‘maximum’ Optimistic: Maximal Potential Gain Messaging Exploration maximized: look for max value SE-mean: Unexplored domain values yield ‘mean’ reward “Realistic”: Limit exploration, satisfied by mean BE-backtrack : Lookahead given reward function distribution Intelligent: Decision theoretic limit on exploration

a2a3Reward 10 a1a2Reward 5 DCOP Framework: Reward Matrix Unknown a1 a2 a3 What if 20 is max reward? SE-optimistic: how will it work?

Lookahead Agent decides: ‘explore’ or ‘backtrack’ to explored state Let Rb be the best reward among explored states The agent will explore for T units only if EU(Explore) > EU(backtrack) Expected Utility of Backtrack: EU(backtrack) = Rb*T

Lookahead Expected utility of explore is calculated as: P(x,n,t e ) is the first order statistic of choosing the maximum as ‘x’ in t e trials E.U.(explore) is sum of three terms: utility of exploring utility of finding a better reward than current R b utility of failing to find a better reward than current R b

Sample Results (Jain et al, IJCAI’09) Decision theoretic approach to exploration Interleave with DCOPs

Towards Multiagent Learning: Stochastic Game Generalize distributed POMDPs Different payoffs for each player, not a common payoff Focus on two person stochastic game Learning algorithms for stochastic games

Stochastic 2-player Game States: S Action sets for each player: A1, A2 P transition probabilities: P(s’| s, a1, a2) R or Reward: two separate rewards: R1(s, a1,a2), depends on actions of all agents R2(s, a1, a2), depends on actions of all agents If R1(s,a1,a2) + R2(s,a1,a2) = 0, then zero sum game State observable (MDP like) Each player: maximize its own (discounted) sum of rewards

Stochastic Game R1(s0) R2(s0) R1(s1)R2(s1) R1(s2) R2(s2) P(s1|s0,a1,a2) P(s2|s0,a1,a2) Reward function depends on the state!

Stochastic Game How are repeated games related to stochastic games?

Stochastic Game Strategies = Policies Since rewards differ for each agent hence expected values differ as well v1(s, π1, π2) gives us the expected value for agent1 in state s, given that agents pursued policies π1, π2 Nash equilibrium in stochastic game: pair of strategies (π1*, π2*) such that for all states s v1(s, π1*, π2*) >= v1(s, π1, π2*) And v2(s, π1*, π2*) >= v2(s, π1*, π2)

Nash Equlibrium Policies In Stochastic games, we focus on policies that attain Nash equilibrium If we don’t find Nash equilibrium, then players may have an incentive to deviate Search for stability is critical Policies may be randomized; may not be deterministic

Example Stochastic game Goalee can move or stay Shooter can move or shoot Zero sum game, goal worth 10 points to shooter Blocking worth 5 points to goalee Cell1 GOAL!! Cell 3Cell 5 Cell 2 GOAL!! Cell 4Cell 6

Work out example

Q-learning in Stochastic Games Nash-Q algorithm: Q1(s, a1,a2) -  Q value of agent1 for state S Q2(s, a1, a2)  Q value of agent2 for state S Optimal Q values: Q1*(s, a1, a2) = R1(s, a1, a2) + λ Σ s’ P(s’|s,a1,a2)* V1(s’, π1*, π2*) Q2*(s, a1, a2) = R2(s, a1, a2) + λ Σ s’ P(s’|s,a1,a2)* V2(s’, π1*, π2*)

Example

Algorithm Consider two agents: Each agent maintains m Q-tables, m = number of states For each state, maintain |A1|*|A2| number of entries in the Q-table |A1| for my actions |A2| for other agents’ actions Q-tables for me and for the other agent

Key Observation State s’ Bimatrix representation: Q1[s’], Q2[s’] Defines a game Can find mixed strategy nash equilibrium for this game Mixed strategy Nash equilibrium: Provides probability distribution for what action to execute

Multiagent Q-Learning Initialize Q tables Loop: Choose action a1 based on π1(s), which is a mixed strategy Nash equilibrium of the game defined by (Q1(s), Q2(s)) Observe r1, r2, a2, s’ Update Q1(s) and Q2(s) using the equations defined below Q1(s,a1,a2)  Q1(s, a1,a2)+  (R(s)+ λ[ Z1 ]– Q1(s,a1,a2)) Z1 = expected reward given N.E. in state s’ due to game Q1(s’),Q2(s’)

What do we end up with: Agents converging into the Nash equilibrium

Towards Multiagent Learning Learning “single agent” in a multiagent setting Ignore other agents except for some property like location Ignore that other agents act intentionally, adapt Advantages: Simpler Easily converges

Single Agent in Multiagent Setting RoboCup Soccer Simulation League Players use model-free reinforcement learning to intercept the ball Learn “on line” during the game

Finding #1: Online Learning Specialized by Opponent Same player position against two different RoboCup teams: Player 1 (forward) against CMUnited and Andhill Against CMUnited, player turns more aggressively

Finding #2: Online Learning Specialized by Role Same team against different players Player 1 (forward) and Player 10 (fullback) against CMUnited

Lessons Learned Surprise in tests against opponent teams: Significant specialization of intercept with both role & opponent Lesson: Transfer of experience or cross-training may be detrimental

Software Multiagent Systems: CS543 Milind Tambe University of Southern California

Similar presentations

Presentation on theme: "Software Multiagent Systems: CS543 Milind Tambe University of Southern California"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software Multiagent Systems: CS543 Milind Tambe University of Southern California

Similar presentations

Presentation on theme: "Software Multiagent Systems: CS543 Milind Tambe University of Southern California"— Presentation transcript:

Similar presentations

About project

Feedback