Download presentation
Presentation is loading. Please wait.
1
Stochastic Planning using Decision Diagrams
Sumit Sanghai
2
Stochastic Planning MDP model ? Finite set of states S
Set of actions A, each having a transition probability matrix Fully observable A reward function R associated with each state
3
Stochastic Planning … Goal ? Problem ? Solution ?
Policy which maximizes the expected total discounted reward in an infinite horizon model Policy is a mapping from state to action Problem ? Total reward can be infinite Solution ? Associate a discount factor b
4
Expected Reward Model Expected Reward Model Optimality ?
Vp(s) = R(s) + b åt Pr(s,p(s),t) Vp(t) Optimality ? Policy p is optimal if Vp >= Vp` for all s and p` Thm : There exists an optimal policy Its value function is denoted as V*
5
Value Iteration Vn+1(s) = R(s) + maxa b åt Pr(s,a,t) Vn(t)
V0(s) = R(s) Stopping condition : if maxs {Vn+1(s) – Vn(s)} < e(1-b) / 2b then Vn+1 is e/2 close to V* Thm : Value Iteration gives optimal policy Problem ? Can be slow if state space too large
6
Boolean Decision Diagrams
Graph Representation of Boolean Functions BDD = Decision Tree – redundancies Remove duplicate nodes Remove any node with both child pointers pointing to the same child
7
BDDs… x x x 1 2 3 f 1 1 1 x1 x2 x3 1 x1 x2 x3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
8
BDDs operations 1 2 1 or 1 2 =
9
Variable ordering (a1 ^ b1) v (a2 ^ b2) v (a3 ^ b3) Linear Growth
Exponential Growth
10
BDD no magic Number of boolean functions With polynomial nodes 2^{2^n}
Exponential functions
11
ADDs BDDs + real valued domain Useful to represent probabilities
12
MDP, State Space and ADDs
Factored MDP S characterized by {X1, X2, …, Xn} Action a from s to s` a from {X1, X2, …, Xn} to {X1`, X2`,…,Xn`} Pr(s,a,s`) ? Pra(Xi`|X1, X2,…,Xn) Each can be represented using ADD
13
Value Iteration Using ADDs
Vn+1(s) = R(s) + b maxa {åt P(s,a,t) Vn(t)} R(s) : ADD P(s,a,t)=Pa(X1`=x1`,…,Xn`=xn`|X1=x1,…,Xn=xn) = ÕiP(Xi`=xi`|X1=x1,…,Xn=xn) Vn+1(X1,…,Xn) = R(X1,…Xn) + b maxa {åX1`,…,Xn` Õi Pa(Xi`|X1,…,Xn) Vn(X1`,…,Xn`)} 2nd term on RHS can be obtained by quantifying X1` first as true or false and multiplying its associated ADD with Vn and summing over all possibilities to eliminate X1` åX1`,…,Xn` {Õi=2 to n {Pa(Xi`|X1,…,Xn)} (Pa(X1`=true|X1,…, Xn) Vn(X1`=true,…,Xn`) + Pa(X1`=false|X1,…, X_n) Vn(X1`=false,…,Xn`) )}
14
Value Iteration Using ADDs (other possibilities)
Which variables are necessary ? variables appearing in the value function Order of variables during elimination ? Inverse order Problem ? Repeated computation of Pr(s,a,t) Solution ? Precompute Pa(X1`,…,Xn`|X1,…,Xn) Mutliply the dual action diagrams
15
Value Iteration… Space vs Time ? Solution (do something intermediate)
Precomputation : huge space required No precomputation : time wasted Solution (do something intermediate) Divide variables into sets (restriction ??) and precompute for them Problems with precomputation Precomputation for sets containing variables which do not appear in value function Dynamic precomputation
16
Experiments Goals ? SPUDD vs Normal value iteration
What is SPI ? How is comparison done ? Worst case of SPUDD Missing links ? SPUDD vs Others Space vs Time experiments
17
Future Work Variable reordering Policy iteration Approximate ADDs
Formal model for structure exploitation BDDs eg. Symmetry detection First order ADDs
18
Approximate ADDs x1 x2 x3 0.9 1.1 0.1 6.7 x1 x2 x3 [0.9,1.1] 0.1 6.7
19
Approximate ADDs At each leaf node ?
Range : [min,max] What value and error do you associate with that leaf ? How and till when to merge the leaves ? max_size vs max_error Max_size mode Merge closest pairs of leaves till size < max_size Max_error mode Merge pairs such that error < max_error
20
Approximate Value Iteration
Vn+1 from Vn ? At each leaf node do calculation for both min and max : eg [min1,max1]*[min2,max2] = [min1*min2, max1*max2] What about maxa step ? Reduce again When to stop ? When the ranges for every state in 2 consecutive value functions overlap or lie within some tolerance (e) How to get policy ? Find actions which maximize value functions (when range is replaced by midpoints) Convergence ?
21
Variable reordering Intuitive ordering Random Rudell’s sifting
Variables which are correlated should be placed together Random Pick pairs of variables and swap them Rudell’s sifting Pick a variable, find a better position Experiments : Sifting did very well
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.