Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Markov Game Analysis for Attack and Defense of Power Networks Chris Y. T. Ma, David K. Y. Yau, Xin Lou, and Nageswara S. V. Rao.

Markov Game Analysis for Attack and Defense of Power Networks Chris Y. T. Ma, David K. Y. Yau, Xin Lou, and Nageswara S. V. Rao.

Pradeep Varakantham Singapore Management University Joint work with J.Y.Kwak, M.Taylor, J. Marecki, P. Scerri, M.Tambe.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Planning under Uncertainty

Generalizing Plans to New Environments in Relational MDPs

Temporal Action-Graph Games: A New Representation for Dynamic Games Albert Xin Jiang University of British Columbia Kevin Leyton-Brown University of British.

In practice, we run into three common issues faced by concurrent optimization algorithms. We alter our model-shaping to mitigate these by reasoning about.

Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.

Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo joint work with Geoff Gordon, Jeff Schneider.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.

Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

AFOSR MURI. Salem, MA. June 4, /10 Coordinated UAV Operations: Perspectives and New Results Vishwesh Kulkarni Joint Work with Jan De Mot, Sommer.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.

Discretization Pieter Abbeel UC Berkeley EECS

Multirobot Coordination in USAR Katia Sycara The Robotics Institute

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

RoboCup: The Robot World Cup Initiative Based on Wikipedia and presentations by Mariya Miteva, Kevin Lam, Paul Marlow.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Instructor: Vincent Conitzer

A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.

MAKING COMPLEX DEClSlONS

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

Planning and Verification for Stochastic Processes with Asynchronous Events Håkan L. S. Younes Carnegie Mellon University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.

Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

By: Messias, Spaan, Lima Presented by: Mike Plasker DMES – Ocean Engineering.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Solving POMDPs through Macro Decomposition

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)

Daphne Koller Overview Conditional Probability Queries Probabilistic Graphical Models Inference.

Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

Daphne Koller Overview Maximum a posteriori (MAP) Probabilistic Graphical Models Inference.

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,

Keep the Adversary Guessing: Agent Security by Policy Randomization

István Szita & András Lőrincz

Markov Decision Processes

Structured Models for Multi-Agent Interactions

Markov Decision Processes

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Presentation transcript:

Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)

©2004 – Carlos Guestrin, Daphne Koller Collaborative Multiagent Planning Search and rescue, firefighting Factory management Multi-robot tasks (Robosoccer) Network routing Air traffic control Computer game playing Long-term goals Multiple agents Coordinated decisions Collaborative Multiagent Planning

©2004 – Carlos Guestrin, Daphne Koller Joint Planning Space Joint action space: Each agent i takes action a i at each step Joint action a= {a 1,…, a n } for all agents Joint state space: Assignment x 1,…,x n to some set of variables X 1,…,X n Joint state x= {x 1,…, x n } of entire system Joint system: Payoffs and state dynamics depend on joint state and joint action Cooperative agents: Want to maximize total payoff

©2004 – Carlos Guestrin, Daphne Koller Exploiting Structure Real-world problems have: Hundreds of objects Googles of states Real-world problems have structure! Approach: Exploit structured representation to obtain efficient approximate solution

©2004 – Carlos Guestrin, Daphne Koller Outline Action Coordination Factored Value Functions Coordination Graphs Context-Specific Coordination Joint Planning Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution Generalizing to New Environments Relational MDPs Generalizing Value Functions

©2004 – Carlos Guestrin, Daphne Koller One-Shot Optimization Task Q-function Q(x,a) encodes agents’ payoff for joint action a in joint state x Agents’ task: To compute #actions is exponential  Complete state observability  Full agent communication 

©2004 – Carlos Guestrin, Daphne Koller Factored Payoff Function Approximate Q function as sum of Q sub-functions Each sub-function depends on local part of system Two interacting agents Agent and important resource Two inter-dependent pieces of machinery [K. & Parr ’99,’00] [Guestrin, K., Parr ’01] Q(A 1,…,A 4, X 1,…,X 4 ) ¼ Q 2 (A 1, A 2, X 1,X 2 )Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) + ++ Q 4 (A 3, A 4, X 3,X 4 )

©2004 – Carlos Guestrin, Daphne Koller Distributed Q Function Q(A 1,…,A 4, X 1,…,X 4 ) Q4Q4 ¼ Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) + ++ [Guestrin, K., Parr ’01] Q sub-functions assigned to relevant agents

©2004 – Carlos Guestrin, Daphne Koller Multiagent Action Selection Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Distributed Q function Instantiate current state x Maximal action argmax a

©2004 – Carlos Guestrin, Daphne Koller Instantiating State x Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Observe only X 1 and X 2 Limited observability: agent i only observes variables in Q i

©2004 – Carlos Guestrin, Daphne Koller Choosing Action at State x Q 2 (A 1, A 2, X 1,X 2 ) Q 4 (A 3, A 4, X 3,X 4 ) Q 1 (A 1, A 4, X 1,X 4 ) Q 3 (A 2, A 3, X 2,X 3 ) Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) Instantiate current state x Maximal action max a

©2004 – Carlos Guestrin, Daphne Koller Variable Elimination Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) max a ++ + Use variable elimination for maximization: Limited communication for optimal action choice Comm. bandwidth = tree-width of coord. graph ),(),(),(max ,, 421 AAgAAQAAQ AAA   ),(),( ),(),( ,, 3421 AAQAAQAAQAAQ AAAA  ),(),(),(),( ,,, 4321 AAQAAQAAQAAQ AAAA  A2A2 A4A4 Value of optimal A 3 action Attack 5 Defend6 Attack8 Defend 12

©2004 – Carlos Guestrin, Daphne Koller Choosing Action at State x ),(),(),(max ,, 421 AAgAAQAAQ AAA   ),(),( ),(),( ,, 3421 AAQAAQAAQAAQ AAAA  ),(),(),(),( ,,, 4321 AAQAAQAAQAAQ AAAA 

©2004 – Carlos Guestrin, Daphne Koller Choosing Action at State x Q 2 (A 1, A 2 ) Q 3 (A 2, A 3 ) Q 4 (A 3, A 4 ) Q 1 (A 1, A 4 ) max A3A3 [ + ] g 1 (A 2, A 4 ) ),(),(),(max ,, 421 AAgAAQAAQ AAA   ),(),( ),(),( ,, 3421 AAQAAQAAQAAQ AAAA  ),(),(),(),( ,,, 4321 AAQAAQAAQAAQ AAAA 

©2004 – Carlos Guestrin, Daphne Koller Coordination Graphs Communication follows triangulated graph Computation grows exponentially in tree width Graph-theoretic measure of “connectedness” Arises in BNs, CSPs, … Cost exponential in worst case, fairly low for many real graphs A4A4 A1A1 A3A3 A2A2 A7A7 A5A5 A6A6 A9A9 A8A8 A 10 A 11

©2004 – Carlos Guestrin, Daphne Koller Context-Specific Interactions Payoff structure can vary by context Agents A1, A2 both trying to pass through same narrow corridor Can use context-specific “value rules” <At(X,A1), At(X,A2), A1 = fwd  A2 = fwd : -100> Hope: Context-specific payoffs will induce context-specific coordination A1 A2 X

©2004 – Carlos Guestrin, Daphne Koller Context-Specific Coordination Instantiate current state: x = true A1A1 A4A4 A2A2 A3A3 A5A5 A6A6

©2004 – Carlos Guestrin, Daphne Koller Context-Specific Coordination A1A1 A4A4 A2A2 A3A3 A5A5 A6A6 Coordination structure varies based on context

©2004 – Carlos Guestrin, Daphne Koller Context-Specific Coordination A1A1 A4A4 A2A2 A3A3 A5A5 A6A6 Maximizing out A 1 Rule-based variable elimination [Zhang & Poole ’99] Coordination structure varies based on communication

©2004 – Carlos Guestrin, Daphne Koller Context-Specific Coordination A1A1 A4A4 A2A2 A3A3 A5A5 A6A6 Eliminate A 1 from the graph Rule-based variable elimination [Zhang & Poole ’99] Coordination structure varies based on agent decisions

©2004 – Carlos Guestrin, Daphne Koller Robot Soccer UvA Trilearn 2002 won German Open 2002, but placed fourth in Robocup “ … the improvements introduced in UvA Trilearn 2003 … include an extension of the intercept skill, improved passing behavior and especially the usage of coordination graphs to specify the coordination requirements between the different agents.” Kok, Vlassis & Groen University of Amsterdam

©2004 – Carlos Guestrin, Daphne Koller RoboSoccer Value Rules Coordination graph rules include conditions on player role and aspects of global system state Example rules for player i, in role of passer: Depends on distance of j to goal after move

©2004 – Carlos Guestrin, Daphne Koller UvA Trilearn 2003 Results Round 1OpponentScore Round 1Mainz Rolling Brains Mainz Rolling Brains (Germany) 4-0 IraniansIranians (Iran)31-0 SahandSahand (Iran)39-0 a4tya4ty (Latvia)25-0 Round 2HeliosHelios (Iran)2-1 AT-HumboldtAT-Humboldt (Germany)5-0 ZJUBaseZJUBase (China)6-0 AriaAria (Iran)6-0 Hana Hana (Japan)26-0 Round 3Zenit-NewERAZenit-NewERA (Russia)4-0 RoboSinaRoboSina (Iran)6-0 Wright EagleWright Eagle (China)3-1 EverestEverest (China)7-1 AriaAria (Iran)5-0 Semi-finalBrainstormersBrainstormers (Germany)4-1 FinalTsinghuAeolusTsinghuAeolus (China) UvA Trilearn won German Open 2003 US Open 2003 RoboCup 2003 German Open 2004

©2004 – Carlos Guestrin, Daphne Koller Outline Action Coordination Factored Value Functions Coordination Graphs Context-Specific Coordination Joint Planning Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution Generalizing to New Environments Relational MDPs Generalizing Value Functions

©2004 – Carlos Guestrin, Daphne Koller peasant footman building Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen

©2004 – Carlos Guestrin, Daphne Koller Planning Over Time Action space: Joint agent actions a= {a 1,…, a n } State space: Joint state descriptions x= {x 1,…, x n } Momentary reward function R(x,a) Probabilistic system dynamics P(x’|x,a) Markov Decision Process (MDP) representation:

©2004 – Carlos Guestrin, Daphne Koller Policy Policy:  (x) = a At state x, action a for all agents  (x 0 ) = both peasants get wood x0x0  (x 1 ) = one peasant gets gold, other builds barrack x1x1  (x 2 ) = Peasants get gold, footmen attack x2x2

©2004 – Carlos Guestrin, Daphne Koller Value of Policy Value: V  (x) Expected long- term reward starting from x Start from x 0 x0x0 R(x 0 )  (x 0 ) V  (x 0 ) = E[R(x 0 ) +  R(x 1 ) +  2 R(x 2 ) +  3 R(x 3 ) +  4 R(x 4 ) +  ] Future rewards discounted by   [0,1) x1x1 R(x 1 ) x 1 ’’ x 1 ’ R(x1’)R(x1’) R(x 1 ’’)  (x 1 ) x2x2 R(x 2 )  (x 2 ) x3x3 R(x 3 )  (x 3 ) x4x4 R(x 4 ) (x1’)(x1’)  (x 1 ’’)

©2004 – Carlos Guestrin, Daphne Koller Optimal Long-term Plan Optimal policy  * (x) Optimal Q-function Q * (x,a) Optimal policy: Bellman Equations:

©2004 – Carlos Guestrin, Daphne Koller Solving an MDP Policy iteration [Howard ’60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ’60] … Solve Bellman equation Optimal value V * (x) Optimal policy  *(x) Many algorithms solve the Bellman equations:

©2004 – Carlos Guestrin, Daphne Koller LP Solution to MDP One variable V (x) for each state One constraint for each state x and action a Polynomial time solution

©2004 – Carlos Guestrin, Daphne Koller Are We Done? Planning is polynomial in #states and #actions #states exponential in number of variables #actions exponential in number of agents Efficient approximation by exploiting structure!

©2004 – Carlos Guestrin, Daphne Koller F’ E’ G’ P’ Structured Representation State Dynamics Decisions Rewards Peasant Footman Enemy Gold R Complexity of representation: Exponential in #parents (worst case) [Boutilier et al. ’95] tt+1 Time A Peasant A Build A Footman P(F’|F,G,A B,A F ) Factored MDP

©2004 – Carlos Guestrin, Daphne Koller Structured Value function ? Factored MDP  Structure in V * Y’’ Z’’ X’’ R Y’’’ Z’’’ X’’’ Time tt+1 R Y’ Z’ X’ t+2t+3 R Z Y X R  Factored MDP Structure in V * Almost! Factored V often provides good approximate value function

©2004 – Carlos Guestrin, Daphne Koller [Bellman et al. ‘63], [Tsitsiklis & Van Roy ‘96] [K. & Parr ’99,’00] Structured Value Functions Approximate V* as a factored value function In rule-based case: h i is a rule concerning small part of the system w i is the value associated with the rule Goal: find w giving good approximation V to V* Factored value function V =  w i h i Factored Q function Q =  Q i Can use coordination graph

©2004 – Carlos Guestrin, Daphne Koller Approximate LP Solution :subject to    ,  ax :minimize  x ),(xaQ )(xV )(xV ),(  xa i i Q)(  x i ii hw)(  x i ii hw One variable w i for each basis function Polynomial number of LP variables One constraint for every state and action  Exponentially many LP constraints,  ax )(  x i i h w i )(  x i h w i i

©2004 – Carlos Guestrin, Daphne Koller So What Now? Exponentially many linear = one nonlinear constraint [Guestrin, K., Parr ’01]

©2004 – Carlos Guestrin, Daphne Koller Variable Elimination Revisited Use Variable Elimination to represent constraints: Exponentially fewer constraints [Guestrin, K., Parr ’01] Polynomial LP for finding good factored approximation to V*

©2004 – Carlos Guestrin, Daphne Koller Network Management Problem Ring Star Ring of Rings k-grid Computer runs processes Computer status = {good, dead, faulty} Dead neighbors increase dying probability Reward for successful processes Each SysAdmin takes local action = {reboot, not reboot }

©2004 – Carlos Guestrin, Daphne Koller Scaling of Factored LP Explicit LPFactored LP k = tree-width 2n2n (n+1-k)2 k Explicit LP number of variables number of constraints Factored LP k = 3 k = 5 k = 8 k = 10 k = 12

©2004 – Carlos Guestrin, Daphne Koller Multiagent Running Time Star single basis Star pair basis Ring of rings

©2004 – Carlos Guestrin, Daphne Koller Strategic 2x2 Factored MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function Q offline online World x a Coordination graph computes argmax a Q(x,a)

©2004 – Carlos Guestrin, Daphne Koller Demo: Strategic 2x2 Guestrin, Koller, Gearhart & Kanodia

©2004 – Carlos Guestrin, Daphne Koller Limited Interaction MDPs Some MDPs have additional structure: Agents are largely autonomous Interact in limited ways e.g., competing for resources Can decompose MDP as set of agent- based MDPs, with limited interface A2A2 A1A1 X1X1 R1R1 X3X3 X2X2 X’ 3 X’ 2 X’ 1 h2h2 h1h1 R2R2 R3R3 A2A2 A1A1 X3X3 X2X2 X’ 3 X’ 2 R2R2 R3R3 A1A1 X1X1 R1R1 X2X2 X’ 1 X1X1 X2X2 X1X1 A1A1 M2M2 M1M1 [Guestrin & Gordon, ’02]

©2004 – Carlos Guestrin, Daphne Koller Limited Interaction MDPs In such MDPs, our LP matrix is highly structured Can use Dantzig-Wolfe LP decomposition to solve LP optimally, in a decentralized way Gives rise to a market-like algorithm with multiple agents and a centralized “auctioneer” [Guestrin & Gordon, ’02]

©2004 – Carlos Guestrin, Daphne Koller Auction-style planning Each agent solves local (stand-alone) MDP Agents send `constraint messages’ to auctioneer: Must agree on “policy” for shared variables Auctioneer sends `pricing messages’ to agents Pricing reflects penalties for constraint violations Influences agents’ rewards in their MDP Auctioneer $ $ $ Set pricing based on conflicts Plan, plan, plan [Guestrin & Gordon, ’02]

©2004 – Carlos Guestrin, Daphne Koller UAV start Target Fuel Allocation Problem UAVs share a pot of fuel Targets have varying priority Ignore target interference Bererton, Gordon, Thrun & Khosla

©2004 – Carlos Guestrin, Daphne Koller [Bererton, Gordon, Thrun, & Khosla, ’03] Fuel Allocation Problem

©2004 – Carlos Guestrin, Daphne Koller High-Speed Robot Paintball Bererton, Gordon & Thrun

©2004 – Carlos Guestrin, Daphne Koller High-Speed Robot Paintball Game variant 1 Game variant 2 Coordination point Sensor Placement x = start location + = goal location

©2004 – Carlos Guestrin, Daphne Koller High-Speed Robot Paintball Bererton, Gordon & Thrun

©2004 – Carlos Guestrin, Daphne Koller Outline Action Coordination Factored Value Functions Coordination Graphs Context-Specific Coordination Joint Planning Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution Generalizing to New Environments Relational MDPs Generalizing Value Functions

©2004 – Carlos Guestrin, Daphne Koller Generalizing to New Problems Solve Problem 1 Solve Problem n Good solution to Problem n+1 Solve Problem 2 MDPs are different!  Different sets of states, action, reward, transition, … Many problems are “similar”

©2004 – Carlos Guestrin, Daphne Koller Generalizing with Relational MDPs Avoid need to replan Tackle larger problems “Similar” domains have similar “types” of objects Exploit similarities by computing generalizable value functions Relational MDP Generalization

©2004 – Carlos Guestrin, Daphne Koller Relational Models and MDPs Classes: Peasant, Footman, Gold, Barracks, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1… Builds on Probabilistic Relational Models [K. & Pfeffer ‘98] [Guestrin, K., Gearhart & Kanodia ‘03]

©2004 – Carlos Guestrin, Daphne Koller Relational MDPs Very compact representation! Does not depend on # of objects Enemy H’ Health R Count Footman H’ Health A Footman my_enemy Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function [Guestrin, K., Gearhart & Kanodia ‘03]

©2004 – Carlos Guestrin, Daphne Koller World is a Large Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP Relational MDP Links between objects Factored MDP # of objects

©2004 – Carlos Guestrin, Daphne Koller MDP with 2 Footmen and 2 Enemies F 1.Health F 1.A F 1.H’ E 1.Health E 1.H’ F 2.Health F 2.A F 2.H’ E 2.Health E 2.H’ R1R1 R2R2 Footman1 Enemy1 Enemy2 Footman2

©2004 – Carlos Guestrin, Daphne Koller World is a Large Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing!  Relational MDP Links between objects Factored MDP # of objects

©2004 – Carlos Guestrin, Daphne Koller Class-level Value Functions F 1.Health E 1.Health F 2.Health E 2.Health Footman1 Enemy1 Enemy2 Footman2 V F1 ( F 1.H, E 1.H ) V E1 ( E 1.H ) V F2 ( F 2.H, E 2.H ) V E2 ( E 2.H ) V  ( F 1.H, E 1.H, F 2.H, E 2.H ) = +++ Units are Interchangeable! V F1  V F2  V F V E1  V E2  V E At state x, each footman has different contribution to V Given w C — can instantiate value function for any world Footman1 Enemy1 Enemy2 Footman2 VFVF VFVF VEVE VEVE

©2004 – Carlos Guestrin, Daphne Koller Factored LP-based Generalization E1E1 F1F1 E2E2 F2F2 Sample Set I VFVF VEVE Generalize E1E1 F1F1 E2E2 F2F2 E3E3 F3F3 Class- level factored LP How many samples?

©2004 – Carlos Guestrin, Daphne Koller Sampling Complexity NO! Exponentially many worlds  need exponentially many samples? # objects in a world is unbounded  must sample very large worlds?

©2004 – Carlos Guestrin, Daphne Koller Theorem samples Value function within O(  ) of class-level value function optimized for all worlds, with prob. at least 1-  R c max is the maximum class reward Sample m small worlds of up to O( ln 1/  ) objects m =

©2004 – Carlos Guestrin, Daphne Koller Strategic 2x2 Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function Q offline online World x a Coordination graph computes argmax a Q(x,a)

©2004 – Carlos Guestrin, Daphne Koller Relational MDP model 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks Factored LP computes value function QoQo offline online World x a Coordination graph computes argmax a Q(x,a) ~3 trillion state/action pairs  grows exponentially in # agents Strategic 9x3

©2004 – Carlos Guestrin, Daphne Koller Strategic Generalization Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes class-level value function wCwC offline online World x a Coordination graph computes argmax a Q  (x,a) 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs instantiated Q-functions grow polynomially in # agents

©2004 – Carlos Guestrin, Daphne Koller Demo: Generalized 9x3 Guestrin, Koller, Gearhart & Kanodia

©2004 – Carlos Guestrin, Daphne Koller Tactical Generalization Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies 3 v. 34 v. 4 Generalize

©2004 – Carlos Guestrin, Daphne Koller Demo: Planned Tactical 3x3 Guestrin, Koller, Gearhart & Kanodia

©2004 – Carlos Guestrin, Daphne Koller Demo: Generalized Tactical 4x4 [Guestrin, K., Gearhart & Kanodia ‘03] Guestrin, Koller, Gearhart & Kanodia

©2004 – Carlos Guestrin, Daphne Koller Summary Structured Multi-Agent MDPs Effective planning under uncertainty Distributed coordinated action selection Generalization to new problems

©2004 – Carlos Guestrin, Daphne Koller Important Questions Continuous spaces Partial observability Complex actions Learning to act How far can we go??

Carlos Guestrin Ronald Parr Chris Gearhart Neal Kanodia Shobha Venkataraman Curt Bererton Geoff Gordon Sebastian Thrun Jelle Kok Matthijs Spaan Nikos Vlassis