Multi-Agent Exploration

Multi-Agent Exploration
Matthew E. Taylor

DCOPs: Distributed Constraint Optimization Problems
Multiple domains Multi-agent plan coordination Sensor networks Meeting scheduling Traffic light coordination RoboCup soccer Distributed Robust to failure Scalable (In)Complete Quality bounds

DCOP Framework a1 a2 a3 Different “levels” of coordination possible a1
Reward 10 6 a2 a3 Reward 10 6 a1 a2 a3 TODO: not graph coloring K-opt: more detail (?): 1-opt up to centralized Different “levels” of coordination possible

Motivation: DCOP Extension
Unrealistic: often environment is not fully known! Agents need to learn Maximize total reward Real-world applications Mobile ad-hoc networks Sensor networks The application of mobile wireless sensor networks in the real world has been on the rise. Examples include the use of autonomous under water vehicles to collect oceanic data, or the deployment of robots in urban environments. For example, such autonomous robots may be used to establish communication in a disaster scenario. However, mobile sensor networks pose different challenges: rewards are unknown, e.g. a robot doesn't know whether moving from one grid to another would be helpful; the sensor network has a limited time which disallows extensive exploration; anytime performance is important e.g. if the experiment goes for 2 hours, the cumulative performance over the 2 hours needs to be good.

Problem Statement DCEE:
Distributed Coordination of Exploration & Exploitation Address Challenges: Local communication Network of (known) interactions Cooperative Unknown rewards Maximize on-line reward Limited time-horizon (Effectively) infinite reward matrix 5

Mobile Ad-Hoc Network Rewards: signal strength between agents [1,200]
Goal: Maximize signal strength over time Assumes: Small Scale fading dominates Topology is fixed a1 75 a2 a1 a2 a3 a4 100 50 75 95 100 a3 50 a4

MGM Review Ideas?

Static Estimation: SE-Optimistic
Rewards on [1,200] If I move, I’d get R=200 a1 a2 a3 a4 100 50 75

Static Estimation: SE-Optimistic
Rewards on [1,200] If I move, I’d gain 275 If I move, I’d gain 250 If I move, I’d gain 100 If I move, I’d gain 125 a1 a2 a3 a3 a4 100 50 75

Results: Simulation Maximize total reward: area under curve
SE-Optimistic No Movement

Balanced Exploration Techniques
BE-Backtrack Decision theoretic calculation of exploration Track previous best location Rb Bid to explore for some number of steps (te) TODO: explain 3 parts Balanced Exploration with Backtracking Assume knowledge of the distribution. BE techniques use the current reward, time left and distribution information to estimate the utility of exploration. BE techniques are more complicated. They require more computation and are harder to implement. Agents can backtrack to a previously visited location. Compares between two actions: Backtrack or Explore. E.U.(explore) is sum of three terms: utility of exploring utility of finding a better reward than current Rb utility of failing to find a better reward than current Rb After agents explore and then backtrack, they could not have reduced the overall reward. In SE methods, the agents evaluate in each time step and then proceed. Here we allow an agent to commit to take an action for more than 1 round. Reward while exploiting × P(improve reward) Reward while exploiting × P(NOT improve reward) Reward while exploring

BE-Backtrack SE-Optimistic No Movement

Omniscient Algorithm (Artificially) convert DCEE to DCOP
Run MGM algorithm [Pearce & Tambe, 2007] Quickly find local optimum Establish upper bound Only works in simulation 13

Omniscient BE-Backtrack SE-Optimistic No Movement

BE-Rebid Allows agents to backtrack Re-evaluate every time-step [Montemerlo04] Allows for on-the-fly reasoning Balanced Exploration with Backtracking and Rebidding 15

BE-Stay Agents unable to backtrack True for some types of robots Dynamic Programming Approach we again assume that no neighbors move which calculating these values 16

(10 agents, random graphs with 15-20 links)
Results (simulation) (10 agents, random graphs with links)

(chain topology, 100 rounds)
Results (simulation) (chain topology, 100 rounds)

Results (simulation) (20 agents, 100 rounds)

Also Tested on Physical Robots
Used iRobot Creates (Unfortunately, they don’t vacuum) Cengen hardware etc.

Sample Robot Results 21

k-Optimality Increased coordination
Find pairs of agents to change variables (location) Higher communication overhead SE-Optimistic SE-Optimistic-2 SE-Optimistic-3 SE-Mean SE-Mean-2 BE-Rebid BE-Rebid-2 BE-Stay BE-Stay-2 22

Confirm Previous DCOP Results
If (artificially) provided rewards, k=2 outperforms k=1

Sample coordination results
Full Graph Chain Graph 24

Surprising Result: Increased Coordination can Hurt

Regular Graphs

Multi-Agent Exploration

Similar presentations

Presentation on theme: "Multi-Agent Exploration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multi-Agent Exploration

Similar presentations

Presentation on theme: "Multi-Agent Exploration"— Presentation transcript:

Similar presentations

About project

Feedback