Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University
Multiagent Coordination Examples Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous decisions Limited observability Limited communication
Network Management Problem Administrators must coordinate to maximize global reward M4M4 M1M1 M3M3 M2M2 Si’Si’ Neighboring machines: LiLi Li’Li’ Load: SiSi Status: AiAi Action: RiRi When process terminates successfully t t+1
Joint Decision Space Represent as MDP: Action space: joint action a= {a 1,…, a n } for all agents State space: joint state x of entire system Reward function: total reward r Action space is exponential in # agents State space is exponential in # variables Global decision requires complete observation
Long-term Utilities One step utility: SysAdmin A i receives reward ($) if process completes Total utility: sum of rewards Optimal action requires long-term planning Long-term utility Q(x,a): Expected reward, given current state x and action a Optimal action at state x is:
Q(A 1,…,A 4, X 1,…,X 4 ) ¼ Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) Local Q function Approximation M4M4 M1M1 M3M3 M2M2 Q3Q3 Q(A 1,…,A 4, X 1,…,X 4 ) Associated with Agent 3 Limited observability: agent i only observes variables in Q i Observe only X 2 and X 3 Must choose action to maximize i Q i
Use variable elimination for maximization: [Bertele & Brioschi ‘72] Maximizing i Q i : Coordination Graph Limited communication for optimal action choice Comm. bandwidth = induced width of coord. graph Here we need only 23, instead of 63 sum operations. A1A1 A4A4 A2A2 A3A3 ),(),(),(max ,, 321 AAgAAQAAQ AAA ),(),( ),(),( ,, 4321 AAQAAQAAQAAQ AAAA ),(),(),(),( ,,, 4321 AAQAAQAAQAAQ AAAA
Where do the Q i come from? Use function approximation to find Q i : Q(X 1, …, X 4, A 1, …, A 4 ) ¼ Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) Long-term planning requires Markov Decision Process # states exponential # actions exponential Efficient approximation by exploiting structure!
Dynamic Decision Diagram M4M4 M1M1 M3M3 M2M2 A3A3 A4A4 A2A2 A1A1 X1X1 X3X3 X4X4 X2X2 R1R1 R2R2 R3R3 R4R4 X3’X3’ X4’X4’ X2’X2’ X1’X1’ State Dynamics Decisions Rewards P(X 1 ’|X 1, X 4, A 1 )
Long-term Utility = Value of MDP Value computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Number of states and actions exponential!
Linear combination of restricted domain functions [Bellman et al. ‘63] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] [Guestrin et al. ’01] Decomposable Value Functions Each h i is status of small part(s) of a complex system: Status of a machine and neighbors Load on machine Must find w giving good approximate value function
Single LP Solution for Factored MDPs One variable w i for each basis function Polynomially many LP variables One constraint for every state and action Exponentially many LP constraints h i, Q i depend on small sets of variables/actions [Schweitzer and Seidmann ‘85]
Representing Exponentially Many Constraints [Guestrin et al ’01] Exponentially many linear = one nonlinear constraint
Representing the Constraints Functions are factored, use Variable Elimination to represent constraints: Number of constraints exponentially smaller
Summary of Algorithm 1.Pick local basis functions h i 2.Single LP to compute local Q i ’s in factored MDP 3.Coordination graph computes maximizing action
Network Management Problem Unidirectional Ring Server Star Ring of Rings
Single Agent Policy Quality number of machines Discounted reward PI = Approximate Policy Iteration with Max-norm Projection [Guestrin et al.’01] Single LP versus Approximate Policy Iteration LP single basis PI single basis LP pair basis LP triple basis
Single Agent Running Time LP single basis PI single basis LP pair basis LP triple basis PI = Approximate Policy Iteration with Max-norm Projection [Guestrin et al.’01]
Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]
Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] Distributed reward Distributed value
Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] LP single basis LP pair basis Distributed reward Distributed value
Multiagent Running Time Star single basis Star pair basis Ring of rings
Conclusions Multiagent planning algorithm: Limited Communication Limited Observability Unified view of function approximation and multiagent communication Single LP solution is simple and very efficient Exploit structure to reduce computation costs! Solve very large MDPs efficiently
Solve Very Large MDPs Solved MDPs with : states over actions and 500 agents;
Conclusions Multiagent planning algorithm: Limited Communication Limited Observability Unified view of function approximation and multiagent communication Single LP solution is simple and very efficient Exploit structure to reduce computation costs! Solve very large MDPs efficiently