Multi-Agent Exploration Matthew E. Taylor http://teamcore.usc.edu/taylorm/
DCOPs: Distributed Constraint Optimization Problems Multiple domains Multi-agent plan coordination Sensor networks Meeting scheduling Traffic light coordination RoboCup soccer Distributed Robust to failure Scalable (In)Complete Quality bounds
DCOP Framework a1 a2 a3 Different “levels” of coordination possible a1 Reward 10 6 a2 a3 Reward 10 6 a1 a2 a3 TODO: not graph coloring K-opt: more detail (?): 1-opt up to centralized Different “levels” of coordination possible
Motivation: DCOP Extension Unrealistic: often environment is not fully known! Agents need to learn Maximize total reward Real-world applications Mobile ad-hoc networks Sensor networks The application of mobile wireless sensor networks in the real world has been on the rise. Examples include the use of autonomous under water vehicles to collect oceanic data, or the deployment of robots in urban environments. For example, such autonomous robots may be used to establish communication in a disaster scenario. However, mobile sensor networks pose different challenges: rewards are unknown, e.g. a robot doesn't know whether moving from one grid to another would be helpful; the sensor network has a limited time which disallows extensive exploration; anytime performance is important e.g. if the experiment goes for 2 hours, the cumulative performance over the 2 hours needs to be good.
Problem Statement DCEE: Distributed Coordination of Exploration & Exploitation Address Challenges: Local communication Network of (known) interactions Cooperative Unknown rewards Maximize on-line reward Limited time-horizon (Effectively) infinite reward matrix 5
Mobile Ad-Hoc Network Rewards: signal strength between agents [1,200] Goal: Maximize signal strength over time Assumes: Small Scale fading dominates Topology is fixed a1 75 a2 a1 a2 a3 a4 100 50 75 95 100 a3 50 a4
MGM Review Ideas?
Static Estimation: SE-Optimistic Rewards on [1,200] If I move, I’d get R=200 a1 a2 a3 a4 100 50 75
Static Estimation: SE-Optimistic Rewards on [1,200] If I move, I’d gain 275 If I move, I’d gain 250 If I move, I’d gain 100 If I move, I’d gain 125 a1 a2 a3 a3 a4 100 50 75
Results: Simulation Maximize total reward: area under curve SE-Optimistic No Movement
Balanced Exploration Techniques BE-Backtrack Decision theoretic calculation of exploration Track previous best location Rb Bid to explore for some number of steps (te) TODO: explain 3 parts Balanced Exploration with Backtracking Assume knowledge of the distribution. BE techniques use the current reward, time left and distribution information to estimate the utility of exploration. BE techniques are more complicated. They require more computation and are harder to implement. Agents can backtrack to a previously visited location. Compares between two actions: Backtrack or Explore. E.U.(explore) is sum of three terms: utility of exploring utility of finding a better reward than current Rb utility of failing to find a better reward than current Rb After agents explore and then backtrack, they could not have reduced the overall reward. In SE methods, the agents evaluate in each time step and then proceed. Here we allow an agent to commit to take an action for more than 1 round. Reward while exploiting × P(improve reward) Reward while exploiting × P(NOT improve reward) Reward while exploring
Results: Simulation Maximize total reward: area under curve BE-Backtrack SE-Optimistic No Movement
Omniscient Algorithm (Artificially) convert DCEE to DCOP Run MGM algorithm [Pearce & Tambe, 2007] Quickly find local optimum Establish upper bound Only works in simulation 13
Results: Simulation Maximize total reward: area under curve Omniscient BE-Backtrack SE-Optimistic No Movement
Balanced Exploration Techniques BE-Rebid Allows agents to backtrack Re-evaluate every time-step [Montemerlo04] Allows for on-the-fly reasoning Balanced Exploration with Backtracking and Rebidding 15
Balanced Exploration Techniques BE-Stay Agents unable to backtrack True for some types of robots Dynamic Programming Approach we again assume that no neighbors move which calculating these values 16
(10 agents, random graphs with 15-20 links) Results (simulation) (10 agents, random graphs with 15-20 links)
(chain topology, 100 rounds) Results (simulation) (chain topology, 100 rounds)
Results (simulation) (20 agents, 100 rounds)
Also Tested on Physical Robots Used iRobot Creates (Unfortunately, they don’t vacuum) Cengen hardware etc.
Sample Robot Results 21
k-Optimality Increased coordination Find pairs of agents to change variables (location) Higher communication overhead SE-Optimistic SE-Optimistic-2 SE-Optimistic-3 SE-Mean SE-Mean-2 BE-Rebid BE-Rebid-2 BE-Stay BE-Stay-2 22
Confirm Previous DCOP Results If (artificially) provided rewards, k=2 outperforms k=1
Sample coordination results Full Graph Chain Graph 24
Surprising Result: Increased Coordination can Hurt
Surprising Result: Increased Coordination can Hurt
Regular Graphs