Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University
Collaborative Multiagent Planning Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Long-term goals Multiple agents Coordinated decisions Collaborative Multiagent Planning
Exploiting Structure Real-world problems have: Hundreds of objects Googles of states Real-world problems have structure! Approach: Exploit structured representation to obtain efficient approximate solution
peasant footman building Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen
Joint Decision Space State space: Joint state x of entire system Action space: Joint action a= {a 1,…, a n } for all agents Reward function: Total reward R(x,a) Transition model: Dynamics of the entire system P(x’|x,a) Markov Decision Process (MDP) Representation:
Policy Policy: (x) = a At state x, action a for all agents (x 0 ) = both peasants get wood x0x0 (x 1 ) = one peasant gets gold, other builds barrack x1x1 (x 2 ) = Peasants get gold, footmen attack x2x2
Value of Policy Value: V (x) Expected long- term reward starting from x Start from x 0 x0x0 R(x 0 ) (x 0 ) V (x 0 ) = E[R(x 0 ) + R(x 1 ) + 2 R(x 2 ) + 3 R(x 3 ) + 4 R(x 4 ) + ] Future rewards discounted by 2 [0,1) x1x1 R(x 1 ) x 1 ’’ x 1 ’ R(x1’)R(x1’) R(x 1 ’’) (x 1 ) x2x2 R(x 2 ) (x 2 ) x3x3 R(x 3 ) (x 3 ) x4x4 R(x 4 ) (x1’)(x1’) (x 1 ’’)
Optimal Long-term Plan Optimal Policy: * (x) Optimal value function V * (x) Optimal policy: Bellman Equations:
Solving an MDP Policy iteration [Howard ’60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ’60] … Solve Bellman equation Optimal value V * (x) Optimal policy *(x) Many algorithms solve the Bellman equations:
LP Solution to MDP Value computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Polynomial time solution [Manne ’60] ),( :subject to :minimize , ax xa x Q)(xV )(xV )(xV, ax )(xV
Planning under Bellman’s “Curse” Planning is Polynomial in #states and #actions #states exponential in number of variables #actions exponential in number of agents Efficient approximation by exploiting structure!
F’ E’ G’ P’ Structure in Representation: Factored MDP State Dynamics Decisions Rewards Peasant Footman Enemy Gold R Complexity of representation: Exponential in #parents (worst case) [Boutilier et al. ’95] tt+1 Time A Peasant A Build A Footman P(F’|F,G,A B,A F )
Structured Value function ? Factored MDP Structure in V * Y’’ Z’’ X’’ R Y’’’ Z’’’ X’’’ Time tt+1 R Y’ Z’ X’ t+2t+3 R Z Y X R Factored MDP Structure in V * Almost! Structured V yields good approximate value function ?
Linear combination of restricted domain functions [Bellman et al. ‘63] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] [Guestrin et al. ’01] Structured Value Functions Each h i is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman Structured V Structured Q Must find w giving good approximate value function i i h i w V)()( ~ xx i i Q Q ~ small # of A i ’s, X j ’s
Approximate LP Solution :subject to , ax :minimize x ),(xaQ )(xV )(xV ),( xa i i Q)( x i ii hw)( x i ii hw One variable w i for each basis function Polynomial number of LP variables One constraint for every state and action Exponentially many LP constraints )( x i i i h w )( x i i i h w, ax [Schweitzer and Seidmann ‘85]
Representing Exponentially Many Constraints Exponentially many linear = one nonlinear constraint [Guestrin, Koller, Parr ’01] Maximization over exponential space
Variable Elimination A D BC Here we need only 23, instead of 63 sum operations ),(),(),(max 121,, CBgCAfBAf CBA ),(),( ),(),( 4321,, DBfDCfCAfBAf DCBA ),(),(),(),( 4321,,, DBfDCfCAfBAf DCBA Variable elimination to maximize over state space [Bertele & Brioschi ‘72] Maximization only exponential in largest factor Tree-width characterizes complexity Graph-theoretic measure of “connectedness” Arises in many settings: integer prog., Bayes nets, comput. geometry, … Structured Value Function
Variable Elimination A D BC Here we need only 23, instead of 63 sum operations ),(),(),(max 121,, CBgCAfBAf CBA ),(),( ),(),( 4321,, DBfDCfCAfBAf DCBA ),(),(),(),( 4321,,, DBfDCfCAfBAf DCBA Variable elimination to maximize over state space [Bertele & Brioschi ‘72] Maximization only exponential in largest factor Tree-width characterizes complexity Graph-theoretic measure of “connectedness” Arises in many settings: integer prog., Bayes nets, comput. geometry, … Structured Value Function small # of A i ’s, X j ’s small # of X j ’s
Representing the Constraints Use Variable Elimination to represent constraints: Number of constraints exponentially smaller!
Understanding Scaling Properties Explicit LPFactored LP k = tree-width 2n2n (n+1-k)2 k Explicit LP number of variables number of constraints Factored LP k = 3 k = 5 k = 8 k = 10 k = 12
Network Management Problem Ring Star Ring of Rings k-grid Computer status = {good, dead, faulty} Dead neighbors increase dying probability Computer runs processes Reward for successful processes Each SysAdmin takes local action = {reboot, not reboot } Problem with n machines 9 n states, 2 n actions
Running Time Ring Exact solution Ring Single basis k=4 Star Single basis k=4 3-grid Single basis k=5 Star Pair basis k=4 Ring Pair basis k=8 k – tree-width
Summary of Algorithm 1.Pick local basis functions h i 2.Factored LP computes value function 3.Policy is argmax a of Q
Large-scale Multiagent Coordination Efficient algorithm computes V Action at state x is: #actions is exponential Complete observability Full communication
Distributed Q Function Q(A 1,…,A 4, X 1,…,X 4 ) [Guestrin, Koller, Parr ’02] Q4Q4 ≈ Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) + ++ Each agent maintains a part of the Q function Distributed Q function
Multiagent Action Selection Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Distributed Q function Instantiate current state x Maximal action argmax a
Instantiate Current State x Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) Observe only X 1 and X 2 Instantiate current state x Limited observability: agent i only observes variables in Q i
Multiagent Action Selection Distributed Q function Instantiate current state x Maximal action argmax a Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 )
Coordination Graph Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) max a ++ + Use variable elimination for maximization: Limited communication for optimal action choice Comm. bandwidth = tree-width of coord. graph A1A1 A3A3 A2A2 A4A4 ),(),(),(max ,, 321 AAgAAQAAQ AAA ),(),( ),(),( ,, 3321 AAQAAQAAQAAQ AAAA ),(),(),(),( ,,, 4321 AAQAAQAAQAAQ AAAA A2A2 A4A4 Value of optimal A 3 action Attack 5 Defend6 Attack8 Defend 12
Coordination Graph Example A4A4 A1A1 A3A3 A2A2 A7A7 A5A5 A6A6 A 11 A9A9 A8A8 A 10 Trees don’t increase communication requirements Cycles require graph triangulation
Unified View: Function Approximation Multiagent Coordination Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) A1A1 A3A3 A2A2 A4A4 Q 1 (A 1, X 1 ) + Q 2 (A 2, X 2 ) + Q 3 (A 3, X 3 ) + Q 4 (A 4, X 4 ) A1A1 A3A3 A2A2 A4A4 Factored MDP and value function representations induce communication, coordination Tradeoff Communication / Accuracy
How good are the policies? SysAdmin problem Power grid problem [Schneider et al. ‘99]
SysAdmin Ring - Quality of Policies Utopic maximum value Exact solution Constraint sampling Single basis Constraint sampling Pair basis Factored LP Single basis
Power Grid – Factored Multiagent Lower is better! [Guestrin, Lagoudakis, Parr ‘02]
Summary of Algorithm 1.Pick local basis functions h i 2.Factored LP computes value function 3.Coordination graph computes argmax a of Q
Planning Complex Environments When faced with a complex problem, exploit structure: For planning For action selection Given new problem Replan from scratch: Different MDP New planning problem Huge problems intractable, even with factored LP
Generalizing to New Problems Solve Problem 1 Solve Problem n Good solution to Problem n+1 Solve Problem 2 MDPs are different! Different sets of states, action, reward, transition, … Many problems are “similar”
Generalization with Relational MDPs Avoid need to replan Tackle larger problems [Guestrin, Koller, Gearhart, Kanodia ’03] “Similar” domains have similar “types” of objects Exploit similarities by computing generalizable value functions Relational MDP Generalization
Relational Models and MDPs Classes: Peasant, Gold, Wood, Barracks, Footman, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1…
Relational MDPs Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function P P’ APAP G Gold G’ Collects Very compact representation! Does not depend on # of objects Peasant
Tactical Freecraft: Relational Schema Enemy H’ Health R Count Footman H’ Health A Footman my_enemy Enemy’s health depends on #footmen attacking Footman’s health depends on Enemy’s health
World is a Large Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP Relational MDP Links between objects Factored MDP # of objects
World with 2 Footmen and 2 Enemies F 1.Health F 1.A F 1.H’ E 1.Health E 1.H’ F 2.Health F 2.A F 2.H’ E 2.Health E 2.H’ R1R1 R2R2 Footman1 Enemy1 Enemy2 Footman2
World is a Large Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing! Relational MDP Links between objects Factored MDP # of objects
Class-level Value Functions F 1.Health E 1.Health F 2.Health E 2.Health Footman1 Enemy1 Enemy2 Footman2 V F1 ( F 1.H, E 1.H ) V E1 ( E 1.H ) V F2 ( F 2.H, E 2.H ) V E2 ( E 2.H ) V ( F 1.H, E 1.H, F 2.H, E 2.H ) = +++ Units are Interchangeable! V F1 V F2 V F V E1 V E2 V E At state x, each footman has different contribution to V Given V C — can instantiate value function for any world Footman1 Enemy1 Enemy2 Footman2 VFVF VFVF VEVE VEVE
Computing Class-level V C :minimize :subject to , ax ),(xaQ )(xV x )(xV CCo C V)( ][ x CCo C Q),( ][ ax CCo C V)( ][ x Constraints for each world represented by factored LP Number of worlds exponential or infinite
Sampling Worlds Many worlds are similar Sample set I of worlds , x, a I, x, a Sampling
Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! samples Value function within of class-level solution optimized for all worlds, with prob. at least 1- R max is the maximum class reward Proof method related to [de Farias, Van Roy ‘ 02]
Learning Classes of Objects V1V1 V2V2 V1V1 V2V2 Plan for sampled worlds separately Objects with similar values belong to same class Find regularities between worlds Used decision tree regression in experiments
Summary of Algorithm 1.Model domain as Relational MDPs 2.Sample set of worlds 3.Factored LP computes class-level value function for sampled worlds 4.Reuse class-level value function in new world 5.Coordination graph computes argmax a of Q
Experimental Results SysAdmin problem
Generalizing to New Problems
Learning Classes of Objects
Classes of Objects Discovered Learned 3 classes Server Intermediate Leaf
Strategic World: 2 Peasants, 2 Footmen, 1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 1 million state/action pairs Algorithm: Solve with Factored LP Coordination graph for action selection
Strategic World: 9 Peasants, 3 Footmen, 1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs Algorithm: Solve with factored LP Coordination graph for action selection grows exponentially in # agents
Strategic World: 9 Peasants, 3 Footmen, 1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs Algorithm: Use generalized class-based value function Coordination graph for action selection instantiated Q-functions grow polynomially in # agents
Tactical Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies 3 vs. 34 vs. 4 Generalize
Contributions Efficient planning with LP decomposition [Guestrin, Koller, Parr ’01] Multiagent action selection [Guestrin, Koller, Parr ’02] Generalization to new environments [Guestrin, Koller, Gearhart, Kanodia ’03] Variable coordination structure [Guestrin, Venkataraman, Koller ’02] Multiagent reinforcement learning [Guestrin, Lagoudakis, Parr ’02] [Guestrin, Patrascu, Schuurmans ’02] Hierarchical decomposition [Guestrin, Gordon ’02]
Open Issues High tree-width problems Basis function selection Variable relational structure Partial observability
Daphne Koller Committee Leslie Kaelbling, Yoav Shoham, Claire Tomlin, Ben Van Roy Co-authors DAGS members Kristina and Friends My Family M.S. Apaydin, D. Brutlag, F. Cozman, C. Gearhart, G. Gordon, D. Hsu, N. Kanodia, D. Koller, E. Krotkov, M. Lagoudakis, J.C. Latombe, D. Ormoneit, R. Parr, R. Patrascu, D. Schuurmans, C. Varma, S. Venkataraman.
In planning problem – Factored LP Exploit Structure In action selection – Coord. graph Between problems – Generalization Complex multiagent planning task Conclusions Formal framework for multiagent planning that scales to very large problemsvery large states
Network Management Problem Ring Star Ring of Rings k-grid Computer runs processes Computer status = {good, dead, faulty} Dead neighbors increase dying probability Reward for successful processes Each SysAdmin takes local action = {reboot, not reboot }
Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]
Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] Distributed reward Distributed value
Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] LP single basis LP pair basis Distributed reward Distributed value
Comparing to Apricodd [Boutilier et al.] Apricodd: Exploits context-specific independence (CSI) Factored LP: Exploits CSI and linear independence
Appricodd Ring Star