Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Decision Theoretic Planning
Markov Decision Processes
Generalizing Plans to New Environments in Relational MDPs
Temporal Action-Graph Games: A New Representation for Dynamic Games Albert Xin Jiang University of British Columbia Kevin Leyton-Brown University of British.
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.
Jointly Optimal Transmission and Probing Strategies for Multichannel Systems Saswati Sarkar University of Pennsylvania Joint work with Sudipto Guha (Upenn)
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Markov Decision Processes
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discretization Pieter Abbeel UC Berkeley EECS
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
MAKING COMPLEX DEClSlONS
Graph Coalition Structure Generation Maria Polukarov University of Southampton Joint work with Tom Voice and Nick Jennings HUJI, 25 th September 2011.
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
An Introduction to Support Vector Machines (M. Law)
1 Markov Decision Processes Basics Concepts Alan Fern.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
MDPs (cont) & Reinforcement Learning
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)
Reinforcement learning (Chapter 21)
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
1 Algorithms for Computing Approximate Nash Equilibria Vangelis Markakis Athens University of Economics and Business.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Non-additive Security Games
Reinforcement learning (Chapter 21)
István Szita & András Lőrincz
Reinforcement learning (Chapter 21)
Structured Models for Multi-Agent Interactions
Max Z = x1 + x2 2 x1 + 3 x2  6 (1) x2  1.5 (2) x1 - x2  2 (3)
Stochastic Planning using Decision Diagrams
CS 416 Artificial Intelligence
Major Design Strategies
Variable Elimination Graphical Models – Carlos Guestrin
Major Design Strategies
Normal Form (Matrix) Games
Reinforcement Learning (2)
Presentation transcript:

Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University

Multiagent Coordination Examples Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous decisions Exponentially-large spaces Limited observability Limited communication

peasant footman building Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen

Scaling up by Generalization Exploit similarities between world elements Generalize plans: From a set of worlds to a new, unseen world Avoid need to replan Tackle larger problems Formalize notion of “similar” elements Compute generalizable plans

Relational Models and MDPs Classes: Peasant, Gold, Wood, Barracks, Footman, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1… Value functions in class level Objects of the same class have same contribution to value function Factored MDP equivalents of PRMs [Koller, Pfeffer ‘98]

Relational MDPs Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function Instantiation (world) Number objects; Relations Well-defined MDP Peasant P’ P APAP Gold G’ G Collects

Planning in a World Long-term planning by solving MDP # states exponential in number of objects # actions exponential Efficient approximation by exploiting structure! RMDP world is a factored MDP

Roadmap to Generalization Solve 1 world Compute generalizable value function Tackle a new world

World is a Factored MDP P F E G R F’ E’ G’ P’ State Dynamics Decisions Rewards P(F’|F,G,H,A F ) H APAP AFAF

Long-term Utility = Value of MDP Value computed by linear programming: One variable V (x) for each state One constraint for each state x and action a Number of states and actions exponential! [Manne `60]

Approximate Value Functions Linear combination of restricted domain functions [Bellman et al. `63] [Tsitsiklis & Van Roy `96] [Koller & Parr `99,`00] [Guestrin et al. `01] Each V o depends on state of object and related objects: State of footman Status of barracks Must find V o giving good approximate value function

Single LP Solution for Factored MDPs Variables for each V o, for each object Polynomially many LP variables One constraint for every state and action  Exponentially many LP constraints V o, Q o depend on small sets of variables/actions Exploit structure as in variable elimination [Guestrin, Koller, Parr `01] [Schweitzer and Seidmann ‘85]

Representing Exponentially Many Constraints Exponentially many linear = one nonlinear constraint

Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72] Variable Elimination A D BC As in Bayes nets, maximization is exponential in tree-width Here we need only 23, instead of 63 sum operations ),(),(),(max 121,, CBgCAfBAf CBA   ),(),( ),(),( 4321,, DBfDCfCAfBAf DCBA  ),(),(),(),( 4321,,, DBfDCfCAfBAf DCBA 

Representing the Constraints Functions are factored, use Variable Elimination to represent constraints: Number of constraints exponentially smaller

Roadmap to Generalization Solve 1 world Compute generalizable value function Tackle a new world

Generalization Sample a set of worlds Solve a linear program for these worlds: Obtain class value functions When faced with new problem: Use class value function No re-planning needed

Worlds and RMDPs Meta-level MDP: Meta-level LP:

Class-level Value Functions Approximate solution to meta-level MDP Linear approximation Value function defined in the class level All instances use same local value function

Class-level LP Constraints for each world represented by factored LP Number of worlds exponential or infinite Sample worlds from P(  )

Theorem Exponentially (infinitely) many worlds ! need exponentially many samples? NO! samples Value function within , with prob. at least 1- . R max is the maximum class reward Proof method related to [de Farias, Van Roy ‘ 02]

LP with sampled worlds Solve LP for sampled worlds Use Factored LP for each world Obtain class-level value function New world: instantiate value function and act

Learning Classes of Objects Which classes of objects have same value function? Plan for sampled worlds individually Use value function as “training data” Find objects with similar values Include features of world Used decision tree regression in experiments

Summary of Generalization Algorithm 1.Model domain as Relational MDPs 2.Pick local object value functions V o 3.Learn classes by solving some instances 4.Sample set of worlds 5.Factored LP computes class-level value function

A New World When faced with a new world , value function is: Q function becomes: At each state, choose action maximizing Q(x,a) Number of actions is exponential! Each Q C depends only on a few objects!!!

Q(A 1,…,A 4, X 1,…,X 4 ) ¼ Q 1 (A 1, A 4, X 1,X 4 ) + Q 2 (A 1, A 2, X 1,X 2 ) + Q 3 (A 2, A 3, X 2,X 3 ) + Q 4 (A 3, A 4, X 3,X 4 ) Local Q function Approximation M4M4 M1M1 M3M3 M2M2 Q3Q3 Q(A 1,…,A 4, X 1,…,X 4 ) Associated with Agent 3 Limited observability: agent i only observes variables in Q i Observe only X 2 and X 3 Must choose action to maximize  i Q i

Use variable elimination for maximization: [Bertele & Brioschi ‘72] Maximizing  i Q i : Coordination Graph Limited communication for optimal action choice Comm. bandwidth = induced width of coord. graph A1A1 A4A4 A2A2 A3A3 ),(),(),(max ,, 321 AAgAAQAAQ AAA   ),(),( ),(),( ,, 4321 AAQAAQAAQAAQ AAAA  ),(),(),(),( ,,, 4321 AAQAAQAAQAAQ AAAA  If A 2 attacks and A 3 defends, then A 4 gets $10

Summary of Algorithm 1.Model domain as Relational MDPs 2.Factored LP computes class-level value function 3.Reuse class-level value function in new world

Experimental Results SysAdmin problem Unidirectional Ring Server Star Ring of Rings

Generalizing to New Problems

Classes of Objects Discovered Learned 3 classes Server Intermediate Leaf

Learning Classes of Objects

Results 2 Peasants, Gold, Wood, Barracks, 2 Footman, Enemy Reward for dead enemy About 1 million of state/action pairs Solve with Factored LP Some factors are exponential Coordination graph for action selection [with Gearhart and Kanodia]

Generalization 9 Peasants, Gold, Wood, Barracks, 3 Footman, Enemy Reward for dead enemy About 3 trillion of state/action pairs Instantiate generalizable value function At run-time, factors are polynomial Coordination graph for action selection

The 3 aspects of this talk Scaling up collaborative multiagent planning Exploiting structure Generalization Factored representation and algorithms Relational MDP, Factored LP, coordination graph Freecraft as a benchmark domain

Conclusions RMDP Compact representation for set of similar planning problems Solve single instance with factored MDP algorithms Tackle sets of problems with class-level value functions Efficient sampling of worlds Learn classes of value functions Generalization to new domains Avoid replanning Solve larger, more complex MDPs