Generalizing Plans to New Environments in Relational MDPs

Name: Generalizing Plans to New Environments in Relational MDPs
Uploaded: 2017-08-13T08:22:02+00:00
Duration: PTM18S12
Description: Generalizing Plans to New Environments in Relational MDPs

Generalizing Plans to New Environments in Relational MDPs
Carlos Guestrin Daphne Koller Chris Gearhart Neal Kanodia Stanford University

Collaborative Multiagent Planning
Long-term goals Multiple agents Coordinated decisions Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control

Real-time Strategy Game
peasant footman Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen building

Structure in Representation: Factored MDP
[Boutilier et al. ‘95] t t+1 Time APeasant AFootman Peasant Footman Enemy Gold F’ E’ G’ P’ # states exponential # actions exponential exact solution is intractable P(F’|F,G,AF) State Dynamics Decisions Rewards Complexity of representation: Exponential in #parents (worst case) R

Structured Value Functions
Linear combination of restricted domain functions [Bellman et al. ’63, Tsitsiklis & Van Roy ’96, Koller & Parr ‘99,’00, Guestrin et al. ‘01] w o w o å = o V ) ( ~ x å = i h w V ) ( x o o Each hi is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman Structured V  Structured Q Must find w giving good approximate value function Vi(Footman)  Qi(Footman, Gold, AFootman) å = o Q ~ small # of Ai’s, Xj’s w o

Approximate LP Solution
[Schweitzer and Seidmann ‘85] ) ( å x o V w o : minimize å x ) ( å x o V w o ) ( x V ) ( å x o V w o ) , ( å x a o Q w o ) , ( å x a o Q w o ) ( å x o V w o : subject to î í ì ) ( x V ) , ( x a Q , " a x , " a x o One variable wi for each object basis function  Polynomial number of LP variables One constraint for every state and action  Exponentially many LP constraints Efficient LP decomposition [Guestrin+al `01]  Functions depend on small sets of variables Polynomial time solution

Summary of Multiagent Algorithm
[Guestrin et al.,`01,`02] Summary of Multiagent Algorithm offline Model world as factored MDP Basis functions selection o hi Factored LP computes value function o w , Qo online Real world x a Coordination graph computes argmaxa Q(x,a)

Planning Complex Environments
When faced with a complex problem, exploit structure: For planning For action selection Given new problem Replan from scratch:  Different MDP  New planning problem Huge problems intractable, even with factored LP

Generalizing to New Problems
Many problems are “similar” Good solution to Problem n+1 Solve Problem 1 Solve Problem 2 Solve Problem n MDPs are different!  Different sets of states, action, reward, transition, …

Generalization with Relational MDPs
“Similar” domains have similar “types” of objects  Relational MDP Exploit similarities by computing generalizable value functions Generalization Avoid need to replan Tackle larger problems

Relational Models and MDPs
Classes: Peasant, Gold, Wood, Barracks, Footman, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1… Builds on Probabilistic Relational Models [Koller, Pfeffer ‘98]

Very compact representation! Does not depend on # of objects
Relational MDPs Enemy Footman Health H’ Health H’ my_enemy AFootman R Count Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function Very compact representation! Does not depend on # of objects

World is a Large Factored MDP
Links between objects Relational MDP # of objects Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP

World with 2 Footmen and 2 Enemies
Footman1 F1.Health F1.A F1.H’ Enemy1 E1.Health E1.H’ R1 Footman2 F2.Health F2.A F2.H’ Enemy2 E2.Health E2.H’ R2

World is a Large Factored MDP
Links between objects Relational MDP # of objects Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing! 

Class-level Value Functions
Footman1 Enemy1 Enemy2 Footman2 Footman1 Enemy1 Enemy2 Footman2 F1.Health E1.Health F2.Health E2.Health VF1(F1.H, E1.H) VE1(E1.H) VF2(F2.H, E2.H) VE2(E2.H) VF VE VF VE V(F1.H, E1.H, F2.H, E2.H) = Units are Interchangeable! VF1  VF2  VF + VE1  VE2  VE + + At state x, each footman has different contribution to V Given wC — can instantiate value function for any world 

Computing Class-level VC
å Î C o V ) ( ] [ x w å x ) ( V w C : minimize å Î C o V ) ( ] [ x w w C å Î C o Q ) , ( ] [ a x w w C : subject to î í ì ) ( x V ) , ( x a Q , " a x Constraints for each world represented efficient by factored LP  Number of worlds exponential or infinite 

Sampling Worlds Many worlds are similar Sample set I of worlds
 , x, a    I ,  x, a Many worlds are similar Sample set I of worlds

Factored LP-based Generalization
How many samples? E1 F1 E2 F2 E3 F3 Class- level factored LP Gen. E1 F1 E2 F2 VF VE Sample Set I

Complexity of Sampling Worlds
Exponentially many worlds ! need exponentially many samples? # objects in a world is unbounded ! must apply LP decomposition to very large worlds? NO!

(Improved) Theorem Sample m small worlds of up to O( ln 1/ ) objects
samples Value function within O() of class-level solution optimized for all worlds, with prob. at least 1- Rcmax is the maximum class reward

Learning Subclasses of Objects
V1 V1 1 2 3 4 2 3 4 5 1 V2 V2 Find regularities between worlds Objects with similar values belong to same class Plan for sampled worlds separately Used decision tree regression in experiments

Summary of Generalization Algorithm
offline Relational MDP model Sampled worlds Class definitions I C Factored LP computes class-level value function new world  wC online Coordination graph computes argmaxa Q(x,a) x a Real world

Experimental Results SysAdmin problem

Generalizing to New Problems

Classes of Objects Discovered
Learned 3 classes Leaf Intermediate Server

Learning Classes of Objects

Strategic Tactical

Strategic 2x2 a x World Relational MDP model
offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World

grows exponentially in # agents
Strategic 9x3 offline Relational MDP model 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs  grows exponentially in # agents Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World

Strategic - Generalization
offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs Factored LP computes class-level value function instantiated Q-functions grow polynomially in # agents wC online Coordination graph computes argmaxa Q(x,a) x a World

Tactical Planned in 3 Footmen versus 3 Enemies
3 vs. 3 4 vs. 4 Generalize Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies

Conclusions Relational MDP representation Class-level value function
Efficient linear program optimizes over sampled environments: Theorem: polynomial sample complexity generalizes from small to large problems Learning subclass definitions Generalization of value functions to new worlds: Avoid replanning Tackle larger worlds

Generalizing Plans to New Environments in Relational MDPs

Similar presentations

Presentation on theme: "Generalizing Plans to New Environments in Relational MDPs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generalizing Plans to New Environments in Relational MDPs

Similar presentations

Presentation on theme: "Generalizing Plans to New Environments in Relational MDPs"— Presentation transcript:

Similar presentations

About project

Feedback