Towards a Theoretic Understanding of DCEE Scott Alfeld, Matthew E Towards a Theoretic Understanding of DCEE Scott Alfeld, Matthew E. Taylor, Prateek Tandon, and Milind Tambe Lafayette College http://teamcore.usc.edu
Forward Pointer When Should There be a “Me” in “Team”? Distributed Multi-Agent Optimization Under Uncertainty Matthew E. Taylor, Manish Jain, Yanquin Jin, Makoto Yooko, & Milind Tambe Wednesday, 8:30 – 10:30 Coordination and Cooperation 1
Teamwork: Foundational MAS Concept Joint actions improve outcome But increases communication & computation Over two decades of work This paper: increased teamwork can harm team Even without considering communication & computation Only considering team reward Multiple algorithms, multiple settings But why?
DCOPs: Distributed Constraint Optimization Problems Multiple domains Meeting scheduling Traffic light coordination RoboCup soccer Multi-agent plan coordination Sensor networks Distributed Robust to failure Scalable (In)Complete Quality bounds
DCOP Framework a1 a2 Reward 10 6 a2 a3 Reward 10 6 a1 a2 a3
DCOP Framework a1 a2 a3 a1 a2 Reward 10 6 a2 a3 Reward 10 6 6 a2 a3 Reward 10 6 a1 a2 a3 TODO: not graph coloring K-opt: more detail (?): 1-opt up to centralized
DCOP Framework a1 a2 a3 Different “levels” of teamwork possible Reward 10 6 a2 a3 Reward 10 6 a1 a2 a3 TODO: not graph coloring K-opt: more detail (?): 1-opt up to centralized Different “levels” of teamwork possible Complete Solution is NP-Hard
D-Cee: Distributed Coordination of Exploration and Exploitation Environment may be unknown Maximize on-line reward over some number of rounds Exploration vs. Exploitation Demonstrated mobile ad-hoc network Simulation [Released] & Robots [Released Soon]
DCOP Distrubted Constraint Optimization Problem
DCOP → DCEE Distributed Coordination of Exploration and Exploitation
DCEE Algorithm: SE-Optimistic (Will build upon later) Rewards on [1,200] If I move, I’d get R=200 a1 a2 a3 a4 99 50 75
DCEE Algorithm: SE-Optimistic (Will build upon later) Rewards on [1,200] If I move, I’d gain 275 If I move, I’d gain 251 If I move, I’d gain 101 If I move, I’d gain 125 a1 a2 a3 a3 a4 99 50 75 Explore or Exploit?
Balanced Exploration Techniques BE-Rebid Decision theoretic calculation of exploration Track previous best location Rb: can backtrack Reason about exploring for some number of steps (te) TODO: explain 3 parts Balanced Exploration with Backtracking Assume knowledge of the distribution. BE techniques use the current reward, time left and distribution information to estimate the utility of exploration. BE techniques are more complicated. They require more computation and are harder to implement. Agents can backtrack to a previously visited location. Compares between two actions: Backtrack or Explore. E.U.(explore) is sum of three terms: utility of exploring utility of finding a better reward than current Rb utility of failing to find a better reward than current Rb After agents explore and then backtrack, they could not have reduced the overall reward. In SE methods, the agents evaluate in each time step and then proceed. Here we allow an agent to commit to take an action for more than 1 round. Reward while exploiting × P(improve reward) Reward while exploiting × P(NOT improve reward) Reward while exploring
Success! [ATSN-09][IJCAI-09] Both classes of (incomplete) algorithms Simulation and on Robots Ad hoc Wireless Network (Improvement if performance > 0) Third International Workshop on Agent Technology for Sensor Networks (at AAMAS-09)
k-Optimality Increased coordination – originally DCOP formulation In DCOP, increased k = increased team reward Find groups of agents to change variables Joint actions Neighbors of moving group cannot move Defines amount of teamwork (Higher communication & computation overheads)
“k-Optimality” in DCEE Groups of size k form, those with the most to gain move (change the value of their variable) A group can only move if no other agents in its neighborhood move
Example: SE-Optimistic-2 Rewards on [1,200] If I move, I’d gain 275 If I move, I’d gain 251 If I move, I’d gain 101 If I move, I’d gain 125 a1 a2 a3 a4 99 50 75 200-99 275 + 250 - 150 251 + 275 - 150 101 + 251 - 101 125 + 275 - 125 a1 a4 99 a2 a2 50 a3 a3 75
Sample coordination results Omniscient: confirms DCOP result, as expected ! ! ? Artificially Supplied Rewards (DCOP) Complete Graph Chain Graph
Physical Implementation Create Robots Mobile ad-hoc Wireless Network
Confirms Team Uncertainty Penalty Averaged over 10 trials each Trend confirmed! (Huge standard error) Total Gain Chain Complete ! ! ?
Problem with “k-Optimal” Unknown rewards cannot know if can increase reward by moving! Define new term: L-Movement # of agents that can change variables per round Independent of exploration algorithm Graph dependant Alternate measure of teamwork
General DCOP Analysis Tool? L-Movement Example: k = 1 algorithms L is the size of the largest maximal independent set of the graph NP-hard to calculate for a general graph harder for higher k Consider ring & complete graphs, both with 5 vertices ring graph: maximal independent set is 2 complete graph: maximal independent set is 1 For k =1 L=1 for a complete graph size of the maximal independent set of a ring graph is: General DCOP Analysis Tool?
Configuration Hypercube No (partial-)assignment is believed to be better than another wlog, agents can select next value when exploring Define configuration hypercube: C Each agent is a dimension is total reward when agent takes value cannot be calculated without exploration values drawn from known reward distribution Moving along an axis in hypercube → agent changing value Example: 3 agents (C is 3 dimensional) Changing from C[a, b, c] to C[a, b, c’] Agent A3 changes from c to c’
How many agents can move? (1/2) In a ring graph with 5 nodes k = 1 : L = 2 k = 2 : L = 3 In a complete graph with 5 nodes k = 1 : L = 1 k = 2 : L = 2
How many agents can move? (2/2) Configuration is reachable by an algorithm with movement L in s steps if an only if and How many agents can move? (2/2) C[2,2] reachable for L=1 if s ≥ 4
L-Movement Experiments For various DCEE problems, distributions, and L: For steps s = 1...30: Construct hypercube with s values per dimension Find M, the max achievable reward in s steps, given L Return average of 50 runs Example: 2D Hypercube Only half reachable if L=1 All locations reachable if L=2
Restricting to L-Movement: Complete Complete Graph k = 1 : L = 1 k = 2 : L = 2 L=1→2 Average Maximum Reward Discovered
Restricting to L-Movement: Ring Ring graph k = 1 : L = 2 k = 2 : L = 3 Average Maximum Reward Discovered
Uniform distribution of rewards Ring Complete Uniform distribution of rewards 4 agents Different normal distribution
k and L: 5-agent graphs K value Ring Graph, L value Complete Graph, L value 1 2 3 4 5 Increasing k changes L less in ring than complete Configuration Hypercube is upper bound Posit a consistent negative effect Suggests why increasing k has different effects: Larger improvement in complete than ring for increasing k
L-movement May Help Explain Team Uncertainty Penalty L = 2 will be able to explore more of C than algorithm with L = 1 Independent of exploration algorithm! Determined by k and graph structure C is upper bound – posit constant negative effect Any algorithm experiences diminishing returns as k increases Consistent with DCOP results L-movement difference between k = 1 algorithms and k = 2 Larger difference in graphs with more agents For k = 1, L = 1 for a complete graph For k = 1, L increases with the number of vertices in a ring graph
Thank you Towards a Theoretic Understanding of DCEE Scott Alfeld, Matthew E. Taylor, Prateek Tandon, and Milind Tambe http://teamcore.usc.edu