Networked Distributed POMDPs: DCOP-Inspired Distributed POMDPs Ranjit Nair, Honeywell Labs Pradeep Varakantham, USC Milind Tambe, USC Makoto Yokoo, Kyushu University
Background: DPOMDP Distributed Partially Observable Markov Decision Problems (DPOMDP): a decision theoretic approach Performance linked to optimality of decision making Explicitly reasons about (+/-ve) rewards and uncertainty. Current methods use centralized planning and distributed execution The complexity of finding optimal policy is NEXP-Complete In many domains, not all agents can interact or affect each other Most current DPOMDP algorithms do not exploit locality of interaction Distributed sensors Disaster Rescue simulations Battlefield simulations
Background: DCOP Cost = 0 Cost = 7 Distributed Constraint Optimization Problem (DCOP): Constraint Graph (V,E) Vertices are agent’s variables (x1, ..,, x4) each with a domain d1, …, d4 Edges represent rewards DCOP algorithms exploit locality of interaction DCOP algorithms do not reason about uncertainty di dj f(di,dj) 1 2 x1 x2 x3 x4 Cost = 0 Cost = 7
Key ideas and contributions Exploit locality of interaction to enable scale-up Hybrid DCOP –DPOMDP approach to collaboratively find joint policy Distributed offline planning and distributed execution Key contributions: ND-POMDP Distributed POMDP model that captures locality of interaction Locally Interacting Distributed Joint Equilibrium-based Search for Policies (LID-JESP) Hill climbing like Distributed Breakout Algorithm (DBA) Distributed Parallel Algorithm for Finding Locally Optimal Joint Policy Globally Optimal Algorithm (GOA) Variable Elimination
Outline Sensor net domain Networked Distributed POMDPs (ND-POMDPs) Locally interacting distributed joint equilibrium-based search for policies (LID-JESP) Globally optimal algorithm Experiments Conclusions and Future Work
Example Domain Two independent targets Each changes position based on its stochastic transition function Sensing agents cannot affect each other or target’s position False positives and false negatives in observing targets possible Reward obtained if two agents track a target correctly together Cost for leaving sensor on E N W S Ag1 Ag3 Ag2 target1 target2 Sec1 Sec3 Sec2 Sec4 Sec5 Ag5 Ag4
Networked Distributed POMDP ND-POMDP for set of n agents Ag: <S, A, P, O, Ω, R, b> World state s ∈ S where S = S1× …× Sn× Su Each agent i ∈ Ag has local state si ∈ Si E.g. Is sensor on or off? Su is the part of the state that no agent can affect E.g. Location of the two targets b is the initial belief state, a probability distribution over S b = b1 … bn. bu A = A1× …× An , where Ai is set of actions for agent i E.g. “Scan East”, “Scan West”, “Turn Off” No communication during execution Agents communicate during planning
ND-POMDP Transition independence: Agent i’s local state cannot be affected by other agents Pi : Si × Su × Ai × Si → [0,1] Pu : Su × Su → [0,1] Ω = Ω1× …× Ωn , where Ωi is set of observations for agent i E.g. Target present in sector Observation independence: Agent i’s observations not dependent on others Oi: Si × Su × Ai × Ωi → [0,1] Reward function R is decomposable R(s,a) = ∑l Rl (sl1, … slk, su, al1, … alk) l Ag, and k = |l| Goal: To find a joint policy π = < π1, …, πn> where πi is the local policy of agent i such that π maximizes the expected joint reward over finite horizon T
ND-POMDP as a DCOP Inter-agent interactions captured by an interaction hypergraph (Ag, E) Each agent is a node Set of hyperedges E = {l| l Ag and Rl is a component of R} Neighborhood of agent i: Set of i’s neighbors Ni = {j ∈ Ag| j ≠ i, l ∈ E, i ∈ l and j ∈ l} Agents are solving a DCOP where: Constraint graph is the interaction hypergraph Variable at each node is the local policy of that agent Optimize expected joint reward Ag1 Ag2 Ag3 Ag5 Ag4 R12 R1 R1: Ag1’s cost for scanning R12: Reward for Ag1 and Ag2 tracking target
ND-POMDP theorems Theorem 1: For an ND-POMDP, expected reward for a policy is the sum of expected rewards for each of the links for policy Global value function is decomposable into value functions for each link Local Neighborhood Utility: V[Ni]: Expected reward obtained from all links involving agent i for executing policy Theorem 2: Locality of interaction: For policies and ’, if i = ’i and Ni = ’Ni then V[Ni] = V’[Ni] Given its neighbor’s policies, local neighborhood utility of agent i does not depend on any non-neighbor’s policy
LID-JESP LID-JESP Algorithm (based on Distributed Breakout Algorithm): Choose local policy randomly Communicate local policy to neighbors Compute local neighborhood utility of current policy wrt to neighbors’ policies Compute local neighborhood utility of best response policy wrt neighbors (GetValue) Communicate the gain (4 - 3) to neighbors If gain is greater than gain of neighbors Change local policy to best response policy Communicate changed policy to neighbors Else If not reached termination go to step 3 Theorem 3: Global Utility is strictly increasing with each iteration until local optimum is reached
Termination Detection Each agent maintains a termination counter Reset to zero is gain > 0 else increment by 1 Exchange counter with neighbors Set counter to min of own counter and neighbors’ counters Termination detected if counter = d (diameter of graph) Theorem 4: LID-JESP will terminate within d cycles of reaching local optimum Theorem 5: If LID-JESP terminates, agents are in a local optimum From Theorems 3-5, LID-JESP will terminate in a local optimum within d cyles
Computing best response policy Given neighbors’ fixed policies, each agent is faced with solving a single agent POMDP State is Note: state is not fully observable Transition function: Observation function: Reward function: Best response computed using Bellman backup approach
Global Optimal Algorithm (GOA) Similar to variable elimination Relies on a tree structured interaction graph Cycle cutset algorithm to eliminate cycles Assumes only binary interactions Phase 1: Values are propagated upwards from leaves to root For each policy, sum up values of its children’s optimal responses Compute value of optimal response to each of the parent’s policies Communicate these values to parent Phase 2: Policies are propagated downwards from root to leaves. Agent chooses policy corresponding to optimal response to parent’s policy Communicates its policy to child
Experiments Compared to: LID-JESP-no-n/w: ignores interaction graph JESP: Centralized solver (Nair2003) 3 agent chain LID-JESP exponentially faster than GOA 4 agent chain LID-JESP is faster than JESP and LID-JESP-no-nw
Experiments 5 agent chain LID-JESP is much faster than JESP and LID-JESP-no-nw Values: LID-JESP values are comparable to GOA Random restarts can be used to find global optimal
Experiments Reasons for speedup: C: No. of cycles G: No. of GetValue calls W: No. of agents that change their policies in a cycle LID-JESP converges in fewer cycles (column C) LID-JESP allows multiple agents to change their policies in a single cycle (column W) JESP has fewer GetValue calls than LID-JESP But each such call was slower
Complexity Complexity of best response: JESP: O(|S|2. |Ai|. ∏j|Ωj|T) depends on entire world state depends on observation histories of all agents LID-JESP: O(|Su×Si×SNi|2. |Ai|. ∏jNi|Ωj|T) depends on observation histories of only neighbors depends only on Su, Si and SNi Increasing number of agents does not affect complexity Fixed number of neighbors Complexity of GOA: Brute force global optimal: O(∏j|πj|.|S|2.∏j|Ωj|T) GOA: O(n.|πj|.|Su×Si×Sj|2. |Ai|.|Ωi|T.|Ωj|T) Increasing number of agents will cause linear increase run time
Conclusions DCOP algorithms are applied to finding solution to Distributed POMDP Exploiting “locality of interaction” reduces run time LID-JESP based on DBA Agents converge to locally optimal joint policy GOA based on variable elimination First distributed parallel algorithms for Distributed POMDPs Complexity increases linearly with increased number of agents Fixed number of neighbors
Future Work How can communication be incorporated? Will introducing communication cause agents to lose locality of interaction Remove assumption of transition independence May cause all agents to be dependent on each other Other globally optimal algorithms Increased parallelism
Backup slides
Global Optimal Consider only binary constraints. Can be applied to n-ary constraints Run distributed cycle cutset algorithm in case graph is not a tree Algorithm: Convert graph into trees and a cycle cutset C For each possible joint policy πC of agents in C Val[πC] = 0 For each tree of agents Val[πC] = + DP-Global (tree, πC) Choose joint policy with highest value
Global Optimal Algorithm (GOA) Similar to variable elimination Relies on a tree structured interaction graph Cycle cutset algorithm to eliminate cycles Assumes only binary interactions Phase 1: Values are propagated upwards from leaves to root From the deepest nodes in the tree to the root, do 1. For each of agent i’s policies, πi do eval(πi) ← ∑ci valueπi ci where valueπi ci is received from child ci. 2. for each parent's policy πj do valueπji ← 0 for each of agent i’s policy πi do set current-eval ← expected-reward(πj , πi) + eval(πi) if valueπji < current-eval then valueπji ← current-eval send valueπji to parent j; Phase 2: Policies are propagated downwards from root to leaves.