Generalizing Plans to New Environments in Relational MDPs

Slides:

Advertisements

Similar presentations

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.

Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.

Efficient Informative Sensing using Multiple Robots

Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.

Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Multiagent Planning with Factored MDPs Carlos Guestrin Stanford University.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Developing a Deterministic Patrolling Strategy for Security Agents Nicola Basilico, Nicola Gatti, Francesco Amigoni.

Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke)

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.

A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.

Fair Allocation with Succinct Representation Azarakhsh Malekian (NWU) Joint Work with Saeed Alaei, Ravi Kumar, Erik Vee UMDYahoo! Research.

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.

ANTs PI Meeting, Nov. 29, 2000W. Zhang, Washington University1 Flexible Methods for Multi-agent distributed resource Allocation by Exploiting Phase Transitions.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.

Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.

Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.

1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,

1 Markov Decision Processes Basics Concepts Alan Fern.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Objective: To graph linear equations

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Probabilistic Graphical Models seminar 15/16 ( ) Haim Kaplan Tel Aviv University.

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Stochastic Optimization for Markov Modulated Networks with Application to Delay Constrained Wireless Scheduling Michael J. Neely University of Southern.

1 Algorithms for Computing Approximate Nash Equilibria Vangelis Markakis Athens University of Economics and Business.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Slide 1/20 Defending Against Strategic Adversaries in Dynamic Pricing Markets for Smart Grids Paul Wood, Saurabh Bagchi Purdue University

Asynchronous Control for Coupled Markov Decision Systems Michael J. Neely University of Southern California Information Theory Workshop (ITW) Lausanne,

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

Daphne Koller Overview Maximum a posteriori (MAP) Probabilistic Graphical Models Inference.

Sullivan Algebra and Trigonometry: Section 12.9 Objectives of this Section Set Up a Linear Programming Problem Solve a Linear Programming Problem.

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,

Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.

Keep the Adversary Guessing: Agent Security by Policy Randomization

István Szita & András Lőrincz

Structured Models for Multi-Agent Interactions

Course Logistics CS533: Intelligent Agents and Decision Making

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Stochastic Planning using Decision Diagrams

Reinforcement Learning Dealing with Partial Observability

15th Scandinavian Workshop on Algorithm Theory

Variable Elimination Graphical Models – Carlos Guestrin

Reinforcement Learning (2)

Reinforcement Learning (2)

Presentation transcript:

Generalizing Plans to New Environments in Relational MDPs Carlos Guestrin Daphne Koller Chris Gearhart Neal Kanodia Stanford University

Collaborative Multiagent Planning Long-term goals Multiple agents Coordinated decisions Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control

Real-time Strategy Game peasant footman Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen building

Structure in Representation: Factored MDP [Boutilier et al. ‘95] t t+1 Time APeasant AFootman Peasant Footman Enemy Gold F’ E’ G’ P’ # states exponential # actions exponential exact solution is intractable P(F’|F,G,AF) State Dynamics Decisions Rewards Complexity of representation: Exponential in #parents (worst case) R

Structured Value Functions Linear combination of restricted domain functions [Bellman et al. ’63, Tsitsiklis & Van Roy ’96, Koller & Parr ‘99,’00, Guestrin et al. ‘01] w o w o å = o V ) ( ~ x å = i h w V ) ( x o o Each hi is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman Structured V  Structured Q Must find w giving good approximate value function Vi(Footman)  Qi(Footman, Gold, AFootman) å = o Q ~ small # of Ai’s, Xj’s w o

Approximate LP Solution [Schweitzer and Seidmann ‘85] ) ( å x o V w o : minimize å x ) ( å x o V w o ) ( x V ) ( å x o V w o ) , ( å x a o Q w o ) , ( å x a o Q w o ) ( å x o V w o : subject to î í ì ) ( x V ³ ) , ( x a Q , " a x , " a x o One variable wi for each object basis function  Polynomial number of LP variables One constraint for every state and action  Exponentially many LP constraints Efficient LP decomposition [Guestrin+al `01]  Functions depend on small sets of variables Polynomial time solution

Summary of Multiagent Algorithm [Guestrin et al.,`01,`02] Summary of Multiagent Algorithm offline Model world as factored MDP Basis functions selection o hi Factored LP computes value function o w , Qo online Real world x a Coordination graph computes argmaxa Q(x,a)

Planning Complex Environments When faced with a complex problem, exploit structure: For planning For action selection Given new problem Replan from scratch:  Different MDP  New planning problem Huge problems intractable, even with factored LP

Generalizing to New Problems Many problems are “similar” Good solution to Problem n+1 Solve Problem 1 Solve Problem 2 Solve Problem n MDPs are different!  Different sets of states, action, reward, transition, …

Generalization with Relational MDPs “Similar” domains have similar “types” of objects  Relational MDP Exploit similarities by computing generalizable value functions Generalization Avoid need to replan Tackle larger problems

Relational Models and MDPs Classes: Peasant, Gold, Wood, Barracks, Footman, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1… Builds on Probabilistic Relational Models [Koller, Pfeffer ‘98]

Very compact representation! Does not depend on # of objects Relational MDPs Enemy Footman Health H’ Health H’ my_enemy AFootman R Count Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function Very compact representation! Does not depend on # of objects

World is a Large Factored MDP Links between objects Relational MDP # of objects Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP

World with 2 Footmen and 2 Enemies Footman1 F1.Health F1.A F1.H’ Enemy1 E1.Health E1.H’ R1 Footman2 F2.Health F2.A F2.H’ Enemy2 E2.Health E2.H’ R2

World is a Large Factored MDP Links between objects Relational MDP # of objects Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing! 

Class-level Value Functions Footman1 Enemy1 Enemy2 Footman2 Footman1 Enemy1 Enemy2 Footman2 F1.Health E1.Health F2.Health E2.Health VF1(F1.H, E1.H) VE1(E1.H) VF2(F2.H, E2.H) VE2(E2.H) VF VE VF VE V(F1.H, E1.H, F2.H, E2.H) = Units are Interchangeable! VF1  VF2  VF + VE1  VE2  VE + + At state x, each footman has different contribution to V Given wC — can instantiate value function for any world 

Computing Class-level VC å Î C o V ) ( ] [ x w å x ) ( V w C : minimize å Î C o V ) ( ] [ x w w C å Î C o Q ) , ( ] [ a x w w C : subject to î í ì ) ( x V ³ ) , ( x a Q , " a x Constraints for each world represented efficient by factored LP  Number of worlds exponential or infinite 

Sampling Worlds Many worlds are similar Sample set I of worlds  , x, a    I ,  x, a Many worlds are similar Sample set I of worlds

Factored LP-based Generalization How many samples? E1 F1 E2 F2 E3 F3 Class- level factored LP Gen. E1 F1 E2 F2 VF VE Sample Set I

Complexity of Sampling Worlds Exponentially many worlds ! need exponentially many samples? # objects in a world is unbounded ! must apply LP decomposition to very large worlds? NO!

(Improved) Theorem Sample m small worlds of up to O( ln 1/ ) objects samples Value function within O() of class-level solution optimized for all worlds, with prob. at least 1- Rcmax is the maximum class reward

Learning Subclasses of Objects V1 V1 1 2 3 4 2 3 4 5 1 V2 V2 Find regularities between worlds Objects with similar values belong to same class Plan for sampled worlds separately Used decision tree regression in experiments

Summary of Generalization Algorithm offline Relational MDP model Sampled worlds Class definitions I C Factored LP computes class-level value function new world  wC online Coordination graph computes argmaxa Q(x,a) x a Real world

Experimental Results SysAdmin problem

Generalizing to New Problems

Classes of Objects Discovered Learned 3 classes Leaf Intermediate Server

Learning Classes of Objects

Strategic Tactical

Strategic 2x2 a x World Relational MDP model offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World

grows exponentially in # agents Strategic 9x3 offline Relational MDP model 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs  grows exponentially in # agents Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World

Strategic - Generalization offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs Factored LP computes class-level value function instantiated Q-functions grow polynomially in # agents wC online Coordination graph computes argmaxa Q(x,a) x a World

Tactical Planned in 3 Footmen versus 3 Enemies 3 vs. 3 4 vs. 4 Generalize Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies

Conclusions Relational MDP representation Class-level value function Efficient linear program optimizes over sampled environments: Theorem: polynomial sample complexity generalizes from small to large problems Learning subclass definitions Generalization of value functions to new worlds: Avoid replanning Tackle larger worlds