Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University.

Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University

Important Agent Coordination Research Questions Which is characterized with greater stability, fixed or dynamic leadership? Can implicit negotiation be used to coordinate effectively in a large state space? Will a dominating or trusting leader establish a greater yielding coordination point? How important is it for a follower agent to know when its leader is punishing it?

General Sum Markov Games Definition of a Markove Game A Tuple (I, S, A i (s), P, R i ) I is a set of n players S is a finite set of states A i (s) is the i th player’s set of actions at state s  S P is a probability distribution over transitions conditioned on the current state and joint actions taken R i (s,  a) is the i th player’s reward for state s and joint actions  a  A(s) = A 1 (s) * … * A n (s). General Sum Unlike constant sum or zero sum

Testbed: Grid Game 2 (Hu & Wellman, 2000) Description Two agents Two agents One Goal One Goal Action set: N, S, E, W, SayPut Action set: N, S, E, W, SayPut Surpassing barriers with Pr(.5) Surpassing barriers with Pr(.5) Rewards Rewards Move cost: 0 Collision: -1 Uncoordinated goal: 100 Coordinated goal: 200 Equilibria Original: Asymmetric (i.e. “Chicken”) Original: Asymmetric (i.e. “Chicken”) Bold agent’s Expected reward = (100+200)/2 = 150 Chicken’s Expected reward = (0+200)/2 = 100 New: Symmetric New: Symmetric

Leaders Bully Greedy Greedy Expects follower to stay out of the way Expects follower to stay out of the wayGodfather Generalization of Tit-for-Tat Generalization of Tit-for-Tat Two moods: trusting or angry Two moods: trusting or angry Expects follower to fulfill its half of targetable pair Expects follower to fulfill its half of targetable pair Both assume follower is implementing best response learning This is Nash-like (Littman & Stone, 2001) This is Nash-like (Littman & Stone, 2001)

Followers: Q 0 Q 1 Both implement extended versions of the Bellman equation Q i (s,  a) = (1-  ) R i (s,  a) +  s` Pr[s`| s,  a] V i (s`) Where: Where:  is the discount rate (eliminates infinite sum problem) V i (s`) = max  a  A(s) Q i (s`,  a) This is analogous to Friend-Q (Littman, 2001) This is analogous to Friend-Q (Littman, 2001) Up date of Q-values: Up date of Q-values: Q i (s,  a) = (1 -  ) Q i (s,  a) +  [ (1-  ) R i (s,  a) +  V i (s`)]  is the learning rate (trades off new vs. old experientce) Action choice:  greedy policy Q 0 cannot see Godfather’s mood and Q 1 can

Experiments Matches: Bully vs. Q0, GF vs. Q0, GF vs. Q1, and Q0 vs. Q0 100 trials each 100 trials each 1 million iterations each trial Measurements taken Convergence of Q values Convergence of Q values Move efficiency Move efficiency Distribution of rewards to each agent Distribution of rewards to each agent Average probability of a follower angering Godfather Average probability of a follower angering Godfather Q parameter settings  = 0.9, , and  = 0.1

Convergence Results All Q-learners converge What equilibria do they converge to? What is the utility of these equilibria?

Utility of Equilibria Experiments Leader Reward Follower Reward Game Count Bully vs. Q0 129.458 (11.349) 59.051 (22.623) 9577.33 (231.807) Godfather vs. Q0 158.125 (20.258) 131.24 (30.243) 7085.24 (681.866) Godfather vs. Q1 178.244 (5.395) 156.742 (10.753) 6565.68 (319.427) Q0 vs. Q0 117.549 (46.065) 136.188 (55.586) 9371.08 (348.526) Q0 vs. Q0 have large standard deviations in their rewards compared with a fixed leader Godfather matches More reward than Bully More reward than Bully Yielding greater symmetry of reward Yielding greater symmetry of reward Godfather vs. Q1 had the greatest yielding equilibrium

Importance of Signals Q 0 ’s lack of punishment-indication causes convergence to angering Godfather Q 1 ’s knowledge of when it is being punished allows it to take actions that never anger the Godfather

Conclusion Coordination points reached by a fixed leader are characterized by much less variance than Follower vs. Follower Implicit threats can be used to establish a symmetric, high yielding, Nash equilibria in large state space A Tit-for-Tat leader achieves better coordination than a dominating leader through conditioning It is imperative to use signals in conditioning a learner when using a multi-state leader

Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University.

Similar presentations

Presentation on theme: "Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University.

Similar presentations

Presentation on theme: "Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University."— Presentation transcript:

Similar presentations

About project

Feedback