Presentation is loading. Please wait.

Presentation is loading. Please wait.

Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Similar presentations


Presentation on theme: "Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University."— Presentation transcript:

1 Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University

2 Motivation MDPs: plan over atomic system states; Policy — specifies action at every state; Polytime algorithms for finding optimal policy. Number of states exponential in state variables.

3 Motivation: BNs meet MDPs Real-world MDPS have: Hundreds of variables; Googles of states. Can we exploit problem specific structure? For representation; For planning. Goal: Merge BN and MDPs for Efficient Computation.

4 Factored MDPs [Boutilier et al.] Total reward adding sub-rewards: R=R 1 +R 2 R2R2 Z R1R1 Y’ Z’ Y X’X Time tt+1 Actions only change small parts of model. Value function: Value of policy starting at state s.

5 Exploiting Structure Structured value function approach: [Boutilier et al. ‘95] Collapse value function using a tree; Works well only when many states have  same value. X Z Model structure may imply structured value function;

6 Decomposable Value Functions Each h i is the status of some small part(s) of a complex system: status of a machine; inventory of a store. Linear combination of restricted domain functions. [Bellman et al. ‘63] [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] K basis functions 2 n states h 1 (s1) h 2 (s1)... h 1 (s2) h 2 (s2)…. A=

7 Our Approach Embed structure into value function space a priori: Project into structured vector space of factored value functions; Efficiently find closest approximation to “true” value. Linear Combination of Structured Features

8 Policy Iteration Value of acting on  Guess V  = greedy(V) V = value of acting on  (2 n x2 n ) (2 n x1) ValueRewardDiscounted expected value

9 Approximate Policy Iteration Guess w 0  t = greedy(A w t ) Aw t+1  value of acting on  t Approximate value determination:

10 Approximate Value Determination Need a projection of the value function into the space of the basis functions: (L d projection) Previous work uses L 2 and weighted-L 2 projections. [Koller & Parr ’99, ’00]

11 Analysis of Approx. PI Theorem: We should be doing projections in Max-norm!

12 Approximate PI: Revisited Guess w 0  t = greedy(A w t ) Aw t+1  value of acting on  t Approximate value determination: Analysis motivating projections in max-norm; Efficient algorithm for max-norm projection.

13 Efficient Max-norm Projection Computing max-norm for fixed weights; Cost networks; Efficient max-norm projection.

14 Efficient Max-norm Projection Computing max-norm for fixed weights; Cost networks; Efficient max-norm projection.

15 Max over Large State Spaces For fixed weights w, compute max-norm: However, if basis and target are functions of only a few variables, we can do it efficiently! Cost Networks can maximize over large state spaces efficiently when function is factored:

16 Efficient Max-norm Projection Computing max-norm for fixed weights; Cost networks; Efficient max-norm projection.

17 Can use variable elimination to maximize over state space: [Bertele & Brioschi ‘72] Cost Networks A D BC As in Bayes nets, maximization is exponential in size of largest factor. Here we need only 16, instead of 64 sum operations.

18 Efficient Max-norm Projection Computing max-norm for fixed weights; Cost networks; Efficient max-norm projection.

19 Algorithm for finding: Max-norm Projection Solve by Linear Programming : [Cheney ’82]

20 Representing the Constraints Explicit representation is exponential (|S|=2 n ) : If basis and target are factored, can use Cost Networks to represent the constraints:

21 Approximate Policy Iteration Guess w 0  t = greedy(A w t ) Aw t+1  value of acting on  t How do represent the policy? How do we update it efficiently? Policy Improvement

22 What about the Policy ? Contextual Action Model: Z Y’ Z’ Y X’X default Z Y’ Z’ Y X’X Action 1 Z Y’ Z’ Y X’X Action 2 Factored value functions and model  compact policy description Policy forms a decision list: If then action 1 else if then action 2 else if then action 1 Theorem: [Koller & Parr ’00]

23 Factored Policy Iteration: Summary Guess V  = greedy(V) V = value of acting on  Structure induces decision-list policy Key operations isomorphic to Bayesian Network inference Time per iteration reduced from O((2 n ) 3 ) to O(poly(k,n,C)) C = largest factor in cost net (function of structure) k = number of basis functions ( k << 2 n ) poly = complexity of LP solver, in practice close to linear

24 Network Management Problem Computers connected in a network; Each computer can fail with some probability; If a computer fails, it increases the probability its neighbors will fail; At every time step, the sys-admin must decide which computer to fix. Bidirectional Ring Ring and Star Server Star 3 Legs Ring of Rings Server

25 Comparing projections in L 2 to L  Max-norm projection also much more efficient: Single cost network rather than many many BN inferences; Use of very efficient LP package (CPLEX). L 2 single basis L  single basis L  pair basis L 2 pair basis

26 Results on Larger Problems: Running Time Runs in time O(n 3 ) not O((2 n ) 3 )

27 Results on Larger Problems: Error Bounds Error remains bounded

28 Conclusions Max-norm projection directly minimizes error bounds; Closed-form projection operation provides exponential complexity reduction; Exploit structure to reduce computation costs! Solve very large MDPs efficiently.

29 Future Work POMDPs (IJCAI’01 workshop paper) ; Additional structure: Factored actions; Relational representations; CSI; Multi-agent systems; Linear program solution for MDP.


Download ppt "Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University."

Similar presentations


Ads by Google