Generalizing Plans to New Environments in Relational MDPs Carlos Guestrin Daphne Koller Chris Gearhart Neal Kanodia Stanford University
Collaborative Multiagent Planning Long-term goals Multiple agents Coordinated decisions Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control
Real-time Strategy Game peasant footman Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen building
Structure in Representation: Factored MDP [Boutilier et al. ‘95] t t+1 Time APeasant AFootman Peasant Footman Enemy Gold F’ E’ G’ P’ # states exponential # actions exponential exact solution is intractable P(F’|F,G,AF) State Dynamics Decisions Rewards Complexity of representation: Exponential in #parents (worst case) R
Structured Value Functions Linear combination of restricted domain functions [Bellman et al. ’63, Tsitsiklis & Van Roy ’96, Koller & Parr ‘99,’00, Guestrin et al. ‘01] w o w o å = o V ) ( ~ x å = i h w V ) ( x o o Each hi is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman Structured V Structured Q Must find w giving good approximate value function Vi(Footman) Qi(Footman, Gold, AFootman) å = o Q ~ small # of Ai’s, Xj’s w o
Approximate LP Solution [Schweitzer and Seidmann ‘85] ) ( å x o V w o : minimize å x ) ( å x o V w o ) ( x V ) ( å x o V w o ) , ( å x a o Q w o ) , ( å x a o Q w o ) ( å x o V w o : subject to î í ì ) ( x V ³ ) , ( x a Q , " a x , " a x o One variable wi for each object basis function Polynomial number of LP variables One constraint for every state and action Exponentially many LP constraints Efficient LP decomposition [Guestrin+al `01] Functions depend on small sets of variables Polynomial time solution
Summary of Multiagent Algorithm [Guestrin et al.,`01,`02] Summary of Multiagent Algorithm offline Model world as factored MDP Basis functions selection o hi Factored LP computes value function o w , Qo online Real world x a Coordination graph computes argmaxa Q(x,a)
Planning Complex Environments When faced with a complex problem, exploit structure: For planning For action selection Given new problem Replan from scratch: Different MDP New planning problem Huge problems intractable, even with factored LP
Generalizing to New Problems Many problems are “similar” Good solution to Problem n+1 Solve Problem 1 Solve Problem 2 Solve Problem n MDPs are different! Different sets of states, action, reward, transition, …
Generalization with Relational MDPs “Similar” domains have similar “types” of objects Relational MDP Exploit similarities by computing generalizable value functions Generalization Avoid need to replan Tackle larger problems
Relational Models and MDPs Classes: Peasant, Gold, Wood, Barracks, Footman, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1… Builds on Probabilistic Relational Models [Koller, Pfeffer ‘98]
Very compact representation! Does not depend on # of objects Relational MDPs Enemy Footman Health H’ Health H’ my_enemy AFootman R Count Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function Very compact representation! Does not depend on # of objects
World is a Large Factored MDP Links between objects Relational MDP # of objects Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP
World with 2 Footmen and 2 Enemies Footman1 F1.Health F1.A F1.H’ Enemy1 E1.Health E1.H’ R1 Footman2 F2.Health F2.A F2.H’ Enemy2 E2.Health E2.H’ R2
World is a Large Factored MDP Links between objects Relational MDP # of objects Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing!
Class-level Value Functions Footman1 Enemy1 Enemy2 Footman2 Footman1 Enemy1 Enemy2 Footman2 F1.Health E1.Health F2.Health E2.Health VF1(F1.H, E1.H) VE1(E1.H) VF2(F2.H, E2.H) VE2(E2.H) VF VE VF VE V(F1.H, E1.H, F2.H, E2.H) = Units are Interchangeable! VF1 VF2 VF + VE1 VE2 VE + + At state x, each footman has different contribution to V Given wC — can instantiate value function for any world
Computing Class-level VC å Î C o V ) ( ] [ x w å x ) ( V w C : minimize å Î C o V ) ( ] [ x w w C å Î C o Q ) , ( ] [ a x w w C : subject to î í ì ) ( x V ³ ) , ( x a Q , " a x Constraints for each world represented efficient by factored LP Number of worlds exponential or infinite
Sampling Worlds Many worlds are similar Sample set I of worlds , x, a I , x, a Many worlds are similar Sample set I of worlds
Factored LP-based Generalization How many samples? E1 F1 E2 F2 E3 F3 Class- level factored LP Gen. E1 F1 E2 F2 VF VE Sample Set I
Complexity of Sampling Worlds Exponentially many worlds ! need exponentially many samples? # objects in a world is unbounded ! must apply LP decomposition to very large worlds? NO!
(Improved) Theorem Sample m small worlds of up to O( ln 1/ ) objects samples Value function within O() of class-level solution optimized for all worlds, with prob. at least 1- Rcmax is the maximum class reward
Learning Subclasses of Objects V1 V1 1 2 3 4 2 3 4 5 1 V2 V2 Find regularities between worlds Objects with similar values belong to same class Plan for sampled worlds separately Used decision tree regression in experiments
Summary of Generalization Algorithm offline Relational MDP model Sampled worlds Class definitions I C Factored LP computes class-level value function new world wC online Coordination graph computes argmaxa Q(x,a) x a Real world
Experimental Results SysAdmin problem
Generalizing to New Problems
Classes of Objects Discovered Learned 3 classes Leaf Intermediate Server
Learning Classes of Objects
Strategic Tactical
Strategic 2x2 a x World Relational MDP model offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World
grows exponentially in # agents Strategic 9x3 offline Relational MDP model 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs grows exponentially in # agents Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World
Strategic - Generalization offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs Factored LP computes class-level value function instantiated Q-functions grow polynomially in # agents wC online Coordination graph computes argmaxa Q(x,a) x a World
Tactical Planned in 3 Footmen versus 3 Enemies 3 vs. 3 4 vs. 4 Generalize Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies
Conclusions Relational MDP representation Class-level value function Efficient linear program optimizes over sampled environments: Theorem: polynomial sample complexity generalizes from small to large problems Learning subclass definitions Generalization of value functions to new worlds: Avoid replanning Tackle larger worlds