Download presentation
1
Generalizing Plans to New Environments in Relational MDPs
Carlos Guestrin Daphne Koller Chris Gearhart Neal Kanodia Stanford University
2
Collaborative Multiagent Planning
Long-term goals Multiple agents Coordinated decisions Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control
3
Real-time Strategy Game
peasant footman Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen building
4
Structure in Representation: Factored MDP
[Boutilier et al. ‘95] t t+1 Time APeasant AFootman Peasant Footman Enemy Gold F’ E’ G’ P’ # states exponential # actions exponential exact solution is intractable P(F’|F,G,AF) State Dynamics Decisions Rewards Complexity of representation: Exponential in #parents (worst case) R
5
Structured Value Functions
Linear combination of restricted domain functions [Bellman et al. ’63, Tsitsiklis & Van Roy ’96, Koller & Parr ‘99,’00, Guestrin et al. ‘01] w o w o å = o V ) ( ~ x å = i h w V ) ( x o o Each hi is status of small part(s) of a complex system: State of footman and enemy Status of barracks Status of barracks and state of footman Structured V Structured Q Must find w giving good approximate value function Vi(Footman) Qi(Footman, Gold, AFootman) å = o Q ~ small # of Ai’s, Xj’s w o
6
Approximate LP Solution
[Schweitzer and Seidmann ‘85] ) ( å x o V w o : minimize å x ) ( å x o V w o ) ( x V ) ( å x o V w o ) , ( å x a o Q w o ) , ( å x a o Q w o ) ( å x o V w o : subject to î í ì ) ( x V ) , ( x a Q , " a x , " a x o One variable wi for each object basis function Polynomial number of LP variables One constraint for every state and action Exponentially many LP constraints Efficient LP decomposition [Guestrin+al `01] Functions depend on small sets of variables Polynomial time solution
7
Summary of Multiagent Algorithm
[Guestrin et al.,`01,`02] Summary of Multiagent Algorithm offline Model world as factored MDP Basis functions selection o hi Factored LP computes value function o w , Qo online Real world x a Coordination graph computes argmaxa Q(x,a)
8
Planning Complex Environments
When faced with a complex problem, exploit structure: For planning For action selection Given new problem Replan from scratch: Different MDP New planning problem Huge problems intractable, even with factored LP
9
Generalizing to New Problems
Many problems are “similar” Good solution to Problem n+1 Solve Problem 1 Solve Problem 2 Solve Problem n MDPs are different! Different sets of states, action, reward, transition, …
10
Generalization with Relational MDPs
“Similar” domains have similar “types” of objects Relational MDP Exploit similarities by computing generalizable value functions Generalization Avoid need to replan Tackle larger problems
11
Relational Models and MDPs
Classes: Peasant, Gold, Wood, Barracks, Footman, Enemy… Relations Collects, Builds, Trains, Attacks… Instances Peasant1, Peasant2, Footman1, Enemy1… Builds on Probabilistic Relational Models [Koller, Pfeffer ‘98]
12
Very compact representation! Does not depend on # of objects
Relational MDPs Enemy Footman Health H’ Health H’ my_enemy AFootman R Count Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function Very compact representation! Does not depend on # of objects
13
World is a Large Factored MDP
Links between objects Relational MDP # of objects Factored MDP Instantiation (world): # instances of each class Links between instances Well-defined factored MDP
14
World with 2 Footmen and 2 Enemies
Footman1 F1.Health F1.A F1.H’ Enemy1 E1.Health E1.H’ R1 Footman2 F2.Health F2.A F2.H’ Enemy2 E2.Health E2.H’ R2
15
World is a Large Factored MDP
Links between objects Relational MDP # of objects Factored MDP Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing!
16
Class-level Value Functions
Footman1 Enemy1 Enemy2 Footman2 Footman1 Enemy1 Enemy2 Footman2 F1.Health E1.Health F2.Health E2.Health VF1(F1.H, E1.H) VE1(E1.H) VF2(F2.H, E2.H) VE2(E2.H) VF VE VF VE V(F1.H, E1.H, F2.H, E2.H) = Units are Interchangeable! VF1 VF2 VF + VE1 VE2 VE + + At state x, each footman has different contribution to V Given wC — can instantiate value function for any world
17
Computing Class-level VC
å Î C o V ) ( ] [ x w å x ) ( V w C : minimize å Î C o V ) ( ] [ x w w C å Î C o Q ) , ( ] [ a x w w C : subject to î í ì ) ( x V ) , ( x a Q , " a x Constraints for each world represented efficient by factored LP Number of worlds exponential or infinite
18
Sampling Worlds Many worlds are similar Sample set I of worlds
, x, a I , x, a Many worlds are similar Sample set I of worlds
19
Factored LP-based Generalization
How many samples? E1 F1 E2 F2 E3 F3 Class- level factored LP Gen. E1 F1 E2 F2 VF VE Sample Set I
20
Complexity of Sampling Worlds
Exponentially many worlds ! need exponentially many samples? # objects in a world is unbounded ! must apply LP decomposition to very large worlds? NO!
21
(Improved) Theorem Sample m small worlds of up to O( ln 1/ ) objects
samples Value function within O() of class-level solution optimized for all worlds, with prob. at least 1- Rcmax is the maximum class reward
22
Learning Subclasses of Objects
V1 V1 1 2 3 4 2 3 4 5 1 V2 V2 Find regularities between worlds Objects with similar values belong to same class Plan for sampled worlds separately Used decision tree regression in experiments
23
Summary of Generalization Algorithm
offline Relational MDP model Sampled worlds Class definitions I C Factored LP computes class-level value function new world wC online Coordination graph computes argmaxa Q(x,a) x a Real world
24
Experimental Results SysAdmin problem
25
Generalizing to New Problems
26
Classes of Objects Discovered
Learned 3 classes Leaf Intermediate Server
27
Learning Classes of Objects
28
Strategic Tactical
29
Strategic 2x2 a x World Relational MDP model
offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World
30
grows exponentially in # agents
Strategic 9x3 offline Relational MDP model 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs grows exponentially in # agents Factored LP computes value function Qo online Coordination graph computes argmaxa Q(x,a) x a World
31
Strategic - Generalization
offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs Factored LP computes class-level value function instantiated Q-functions grow polynomially in # agents wC online Coordination graph computes argmaxa Q(x,a) x a World
32
Tactical Planned in 3 Footmen versus 3 Enemies
3 vs. 3 4 vs. 4 Generalize Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies
33
Conclusions Relational MDP representation Class-level value function
Efficient linear program optimizes over sampled environments: Theorem: polynomial sample complexity generalizes from small to large problems Learning subclass definitions Generalization of value functions to new worlds: Avoid replanning Tackle larger worlds
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.