Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Generalizing Plans to New Environments in Relational MDPs
Temporal Action-Graph Games: A New Representation for Dynamic Games Albert Xin Jiang University of British Columbia Kevin Leyton-Brown University of British.
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.
Markov Decision Processes
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discretization Pieter Abbeel UC Berkeley EECS
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.
Instructor: Vincent Conitzer
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
CS584 - Software Multiagent Systems Lecture 12 Distributed constraint optimization II: Incomplete algorithms and recent theoretical results.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Solving POMDPs through Macro Decomposition
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
MDPs (cont) & Reinforcement Learning
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Keep the Adversary Guessing: Agent Security by Policy Randomization
István Szita & András Lőrincz
Structured Models for Multi-Agent Interactions
Stochastic Planning using Decision Diagrams
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Variable Elimination Graphical Models – Carlos Guestrin
Presentation transcript:

Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University

Collaborative Multiagent Planning Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Long-term goals Multiple agents Coordinated decisions Collaborative Multiagent Planning

Exploiting Structure Real-world problems have: Hundreds of objects Googles of states Real-world problems have structure! Approach: Exploit structured representation to obtain efficient approximate solution

peasant footman building Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen

Joint Decision Space State space: Joint state x of entire system Action space: Joint action a= {a 1,…, a n } for all agents Reward function: Total reward R(x,a) Transition model: Dynamics of the entire system P(x’|x,a) Markov Decision Process (MDP) Representation:

Policy Policy:  (x) = a At state x, action a for all agents  (x 0 ) = both peasants get wood x0x0  (x 1 ) = one peasant gets gold, other builds barrack x1x1  (x 2 ) = Peasants get gold, footmen attack x2x2

Value of Policy Value: V  (x) Expected long- term reward starting from x Start from x 0 x0x0 R(x 0 )  (x 0 ) V  (x 0 ) = E[R(x 0 ) +  R(x 1 ) +  2 R(x 2 ) +  3 R(x 3 ) +  4 R(x 4 ) +  ] Future rewards discounted by  2 [0,1) x1x1 R(x 1 ) x 1 ’’ x 1 ’ R(x1’)R(x1’) R(x 1 ’’)  (x 1 ) x2x2 R(x 2 )  (x 2 ) x3x3 R(x 3 )  (x 3 ) x4x4 R(x 4 ) (x1’)(x1’)  (x 1 ’’)

Optimal Long-term Plan Optimal Policy:  * (x) Optimal value function V * (x) Optimal policy: Bellman Equations:

Solving an MDP Policy iteration [Howard ’60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ’60] … Solve Bellman equation Optimal value V * (x) Optimal policy  *(x) Many algorithms solve the Bellman equations:

LP Solution to MDP Value computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Polynomial time solution [Manne ’60] ),( :subject to :minimize     ,  ax xa x Q)(xV )(xV )(xV,  ax )(xV

Planning under Bellman’s “Curse” Planning is Polynomial in #states and #actions #states exponential in number of variables #actions exponential in number of agents Efficient approximation by exploiting structure!

F’ E’ G’ P’ Structure in Representation: Factored MDP State Dynamics Decisions Rewards Peasant Footman Enemy Gold R Complexity of representation: Exponential in #parents (worst case) [Boutilier et al. ’95] tt+1 Time A Peasant A Build A Footman P(F’|F,G,A B,A F )

Structured Value function ? Factored MDP  Structure in V * Y’’ Z’’ X’’ R Y’’’ Z’’’ X’’’ Time tt+1 R Y’ Z’ X’ t+2t+3 R Z Y X R  Factored MDP Structure in V * Almost! Structured V yields good approximate value function ?

Linear combination of restricted domain functions [Bellman et al. ‘63] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] [Guestrin et al. ’01] Structured Value Functions Each h i is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman Structured V  Structured Q Must find w giving good approximate value function   i i h i w V)()( ~ xx   i i Q Q ~ small # of A i ’s, X j ’s

Approximate LP Solution :subject to    ,  ax :minimize  x ),(xaQ )(xV )(xV ),(  xa i i Q)(  x i ii hw)(  x i ii hw One variable w i for each basis function Polynomial number of LP variables One constraint for every state and action  Exponentially many LP constraints )(  x i i i h w )(  x i i i h w,  ax [Schweitzer and Seidmann ‘85]

Representing Exponentially Many Constraints Exponentially many linear = one nonlinear constraint [Guestrin, Koller, Parr ’01] Maximization over exponential space 

Variable Elimination A D BC Here we need only 23, instead of 63 sum operations ),(),(),(max 121,, CBgCAfBAf CBA   ),(),( ),(),( 4321,, DBfDCfCAfBAf DCBA  ),(),(),(),( 4321,,, DBfDCfCAfBAf DCBA  Variable elimination to maximize over state space [Bertele & Brioschi ‘72] Maximization only exponential in largest factor Tree-width characterizes complexity Graph-theoretic measure of “connectedness” Arises in many settings: integer prog., Bayes nets, comput. geometry, … Structured Value Function

Variable Elimination A D BC Here we need only 23, instead of 63 sum operations ),(),(),(max 121,, CBgCAfBAf CBA   ),(),( ),(),( 4321,, DBfDCfCAfBAf DCBA  ),(),(),(),( 4321,,, DBfDCfCAfBAf DCBA  Variable elimination to maximize over state space [Bertele & Brioschi ‘72] Maximization only exponential in largest factor Tree-width characterizes complexity Graph-theoretic measure of “connectedness” Arises in many settings: integer prog., Bayes nets, comput. geometry, … Structured Value Function small # of A i ’s, X j ’s small # of X j ’s

Representing the Constraints Use Variable Elimination to represent constraints: Number of constraints exponentially smaller!

Understanding Scaling Properties Explicit LPFactored LP k = tree-width 2n2n (n+1-k)2 k Explicit LP number of variables number of constraints Factored LP k = 3 k = 5 k = 8 k = 10 k = 12

Network Management Problem Ring Star Ring of Rings k-grid Computer status = {good, dead, faulty} Dead neighbors increase dying probability Computer runs processes Reward for successful processes Each SysAdmin takes local action = {reboot, not reboot } Problem with n machines  9 n states, 2 n actions

Running Time Ring Exact solution Ring Single basis k=4 Star Single basis k=4 3-grid Single basis k=5 Star Pair basis k=4 Ring Pair basis k=8 k – tree-width

Summary of Algorithm 1.Pick local basis functions h i 2.Factored LP computes value function 3.Policy is argmax a of Q

Large-scale Multiagent Coordination Efficient algorithm computes V Action at state x is: #actions is exponential  Complete observability  Full communication 

Distributed Q Function Q(A 1,…,A 4, X 1,…,X 4 ) [Guestrin, Koller, Parr ’02] Q4Q4 ≈ Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) + ++ Each agent maintains a part of the Q function Distributed Q function

Multiagent Action Selection Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Distributed Q function Instantiate current state x Maximal action argmax a

Instantiate Current State x Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) Observe only X 1 and X 2 Instantiate current state x Limited observability: agent i only observes variables in Q i

Multiagent Action Selection Distributed Q function Instantiate current state x Maximal action argmax a Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 )

Coordination Graph Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) max a ++ + Use variable elimination for maximization: Limited communication for optimal action choice Comm. bandwidth = tree-width of coord. graph A1A1 A3A3 A2A2 A4A4 ),(),(),(max ,, 321 AAgAAQAAQ AAA   ),(),( ),(),( ,, 3321 AAQAAQAAQAAQ AAAA  ),(),(),(),( ,,, 4321 AAQAAQAAQAAQ AAAA  A2A2 A4A4 Value of optimal A 3 action Attack 5 Defend6 Attack8 Defend 12

Coordination Graph Example A4A4 A1A1 A3A3 A2A2 A7A7 A5A5 A6A6 A 11 A9A9 A8A8 A 10 Trees don’t increase communication requirements Cycles require graph triangulation

Unified View: Function Approximation  Multiagent Coordination Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) A1A1 A3A3 A2A2 A4A4 Q 1 (A 1, X 1 ) + Q 2 (A 2, X 2 ) + Q 3 (A 3, X 3 ) + Q 4 (A 4, X 4 ) A1A1 A3A3 A2A2 A4A4 Factored MDP and value function representations induce communication, coordination Tradeoff Communication / Accuracy

How good are the policies? SysAdmin problem Power grid problem [Schneider et al. ‘99]

SysAdmin Ring - Quality of Policies Utopic maximum value Exact solution Constraint sampling Single basis Constraint sampling Pair basis Factored LP Single basis

Power Grid – Factored Multiagent Lower is better! [Guestrin, Lagoudakis, Parr ‘02]

Summary of Algorithm 1.Pick local basis functions h i 2.Factored LP computes value function 3.Coordination graph computes argmax a of Q

Planning Complex Environments When faced with a complex problem, exploit structure:  For planning  For action selection Given new problem Replan from scratch:   Different MDP  New planning problem  Huge problems intractable, even with factored LP

Generalizing to New Problems Solve Problem 1 Solve Problem n Good solution to Problem n+1 Solve Problem 2 MDPs are different!  Different sets of states, action, reward, transition, … Many problems are “similar”

Generalization with Relational MDPs Avoid need to replan Tackle larger problems [Guestrin, Koller, Gearhart, Kanodia ’03] “Similar” domains have similar “types” of objects Exploit similarities by computing generalizable value functions Relational MDP Generalization

Relational Models and MDPs Classes: Peasant, Gold, Wood, Barracks, Footman, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1…

Relational MDPs Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function P P’ APAP G Gold G’ Collects Very compact representation! Does not depend on # of objects Peasant

Tactical Freecraft: Relational Schema Enemy H’ Health R Count Footman H’ Health A Footman my_enemy Enemy’s health depends on #footmen attacking Footman’s health depends on Enemy’s health

World is a Large Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP Relational MDP Links between objects Factored MDP # of objects

World with 2 Footmen and 2 Enemies F 1.Health F 1.A F 1.H’ E 1.Health E 1.H’ F 2.Health F 2.A F 2.H’ E 2.Health E 2.H’ R1R1 R2R2 Footman1 Enemy1 Enemy2 Footman2

World is a Large Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing!  Relational MDP Links between objects Factored MDP # of objects

Class-level Value Functions F 1.Health E 1.Health F 2.Health E 2.Health Footman1 Enemy1 Enemy2 Footman2 V F1 ( F 1.H, E 1.H ) V E1 ( E 1.H ) V F2 ( F 2.H, E 2.H ) V E2 ( E 2.H ) V  ( F 1.H, E 1.H, F 2.H, E 2.H ) = +++ Units are Interchangeable! V F1  V F2  V F V E1  V E2  V E At state x, each footman has different contribution to V Given V C — can instantiate value function for any world Footman1 Enemy1 Enemy2 Footman2 VFVF VFVF VEVE VEVE

Computing Class-level V C :minimize :subject to    ,  ax ),(xaQ )(xV  x )(xV   CCo C V)( ][ x    CCo C Q),( ][ ax    CCo C V)( ][ x  Constraints for each world represented by factored LP Number of worlds exponential or infinite 

Sampling Worlds Many worlds are similar Sample set I of worlds  , x, a    I,  x, a Sampling

Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! samples Value function within  of class-level solution optimized for all worlds, with prob. at least 1-  R max is the maximum class reward Proof method related to [de Farias, Van Roy ‘ 02]

Learning Classes of Objects V1V1 V2V2 V1V1 V2V2 Plan for sampled worlds separately Objects with similar values belong to same class Find regularities between worlds Used decision tree regression in experiments

Summary of Algorithm 1.Model domain as Relational MDPs 2.Sample set of worlds 3.Factored LP computes class-level value function for sampled worlds 4.Reuse class-level value function in new world 5.Coordination graph computes argmax a of Q

Experimental Results SysAdmin problem

Generalizing to New Problems

Learning Classes of Objects

Classes of Objects Discovered Learned 3 classes Server Intermediate Leaf

Strategic World: 2 Peasants, 2 Footmen, 1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 1 million state/action pairs Algorithm: Solve with Factored LP Coordination graph for action selection

Strategic World: 9 Peasants, 3 Footmen, 1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs Algorithm: Solve with factored LP Coordination graph for action selection grows exponentially in # agents 

Strategic World: 9 Peasants, 3 Footmen, 1 Enemy, Gold, Wood, Barracks Reward for dead enemy About 3 trillion state/action pairs Algorithm: Use generalized class-based value function Coordination graph for action selection instantiated Q-functions grow polynomially in # agents

Tactical Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies 3 vs. 34 vs. 4 Generalize

Contributions Efficient planning with LP decomposition [Guestrin, Koller, Parr ’01] Multiagent action selection [Guestrin, Koller, Parr ’02] Generalization to new environments [Guestrin, Koller, Gearhart, Kanodia ’03] Variable coordination structure [Guestrin, Venkataraman, Koller ’02] Multiagent reinforcement learning [Guestrin, Lagoudakis, Parr ’02] [Guestrin, Patrascu, Schuurmans ’02] Hierarchical decomposition [Guestrin, Gordon ’02]

Open Issues High tree-width problems Basis function selection Variable relational structure Partial observability

Daphne Koller Committee Leslie Kaelbling, Yoav Shoham, Claire Tomlin, Ben Van Roy Co-authors DAGS members Kristina and Friends My Family M.S. Apaydin, D. Brutlag, F. Cozman, C. Gearhart, G. Gordon, D. Hsu, N. Kanodia, D. Koller, E. Krotkov, M. Lagoudakis, J.C. Latombe, D. Ormoneit, R. Parr, R. Patrascu, D. Schuurmans, C. Varma, S. Venkataraman.

In planning problem – Factored LP Exploit Structure In action selection – Coord. graph Between problems – Generalization Complex multiagent planning task Conclusions Formal framework for multiagent planning that scales to very large problemsvery large states

Network Management Problem Ring Star Ring of Rings k-grid Computer runs processes Computer status = {good, dead, faulty} Dead neighbors increase dying probability Reward for successful processes Each SysAdmin takes local action = {reboot, not reboot }

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99]

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] Distributed reward Distributed value

Multiagent Policy Quality Comparing to Distributed Reward and Distributed Value Function algorithms [Schneider et al. ‘99] LP single basis LP pair basis Distributed reward Distributed value

Comparing to Apricodd [Boutilier et al.] Apricodd: Exploits context-specific independence (CSI) Factored LP: Exploits CSI and linear independence

Appricodd Ring Star