Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stochastic Planning using Decision Diagrams

Similar presentations


Presentation on theme: "Stochastic Planning using Decision Diagrams"— Presentation transcript:

1 Stochastic Planning using Decision Diagrams
Sumit Sanghai

2 Stochastic Planning MDP model ? Finite set of states S
Set of actions A, each having a transition probability matrix Fully observable A reward function R associated with each state

3 Stochastic Planning … Goal ? Problem ? Solution ?
Policy which maximizes the expected total discounted reward in an infinite horizon model Policy is a mapping from state to action Problem ? Total reward can be infinite Solution ? Associate a discount factor b

4 Expected Reward Model Expected Reward Model Optimality ?
Vp(s) = R(s) + b åt Pr(s,p(s),t) Vp(t) Optimality ? Policy p is optimal if Vp >= Vp` for all s and p` Thm : There exists an optimal policy Its value function is denoted as V*

5 Value Iteration Vn+1(s) = R(s) + maxa b åt Pr(s,a,t) Vn(t)
V0(s) = R(s) Stopping condition : if maxs {Vn+1(s) – Vn(s)} < e(1-b) / 2b then Vn+1 is e/2 close to V* Thm : Value Iteration gives optimal policy Problem ? Can be slow if state space too large

6 Boolean Decision Diagrams
Graph Representation of Boolean Functions BDD = Decision Tree – redundancies Remove duplicate nodes Remove any node with both child pointers pointing to the same child

7 BDDs… x x x 1 2 3 f 1 1 1 x1 x2 x3 1 x1 x2 x3 1 1 1 1 1 1 1 1 1 1 1 1 1 1

8 BDDs operations 1 2 1 or 1 2 =

9 Variable ordering (a1 ^ b1) v (a2 ^ b2) v (a3 ^ b3) Linear Growth
Exponential Growth

10 BDD no magic Number of boolean functions With polynomial nodes 2^{2^n}
Exponential functions

11 ADDs BDDs + real valued domain Useful to represent probabilities

12 MDP, State Space and ADDs
Factored MDP S characterized by {X1, X2, …, Xn} Action a from s to s`  a from {X1, X2, …, Xn} to {X1`, X2`,…,Xn`} Pr(s,a,s`) ? Pra(Xi`|X1, X2,…,Xn) Each can be represented using ADD

13 Value Iteration Using ADDs
Vn+1(s) = R(s) + b maxa {åt P(s,a,t) Vn(t)} R(s) : ADD P(s,a,t)=Pa(X1`=x1`,…,Xn`=xn`|X1=x1,…,Xn=xn) = ÕiP(Xi`=xi`|X1=x1,…,Xn=xn)  Vn+1(X1,…,Xn) = R(X1,…Xn) + b maxa {åX1`,…,Xn` Õi Pa(Xi`|X1,…,Xn) Vn(X1`,…,Xn`)} 2nd term on RHS can be obtained by quantifying X1` first as true or false and multiplying its associated ADD with Vn and summing over all possibilities to eliminate X1` åX1`,…,Xn` {Õi=2 to n {Pa(Xi`|X1,…,Xn)} (Pa(X1`=true|X1,…, Xn) Vn(X1`=true,…,Xn`) + Pa(X1`=false|X1,…, X_n) Vn(X1`=false,…,Xn`) )}

14 Value Iteration Using ADDs (other possibilities)
Which variables are necessary ? variables appearing in the value function Order of variables during elimination ? Inverse order Problem ? Repeated computation of Pr(s,a,t) Solution ? Precompute Pa(X1`,…,Xn`|X1,…,Xn) Mutliply the dual action diagrams

15 Value Iteration… Space vs Time ? Solution (do something intermediate)
Precomputation : huge space required No precomputation : time wasted Solution (do something intermediate) Divide variables into sets (restriction ??) and precompute for them Problems with precomputation Precomputation for sets containing variables which do not appear in value function Dynamic precomputation

16 Experiments Goals ? SPUDD vs Normal value iteration
What is SPI ? How is comparison done ? Worst case of SPUDD Missing links ? SPUDD vs Others Space vs Time experiments

17 Future Work Variable reordering Policy iteration Approximate ADDs
Formal model for structure exploitation BDDs eg. Symmetry detection First order ADDs

18 Approximate ADDs x1 x2 x3 0.9 1.1 0.1 6.7 x1 x2 x3 [0.9,1.1] 0.1 6.7

19 Approximate ADDs At each leaf node ?
Range : [min,max] What value and error do you associate with that leaf ? How and till when to merge the leaves ? max_size vs max_error Max_size mode Merge closest pairs of leaves till size < max_size Max_error mode Merge pairs such that error < max_error

20 Approximate Value Iteration
Vn+1 from Vn ? At each leaf node do calculation for both min and max : eg [min1,max1]*[min2,max2] = [min1*min2, max1*max2] What about maxa step ? Reduce again When to stop ? When the ranges for every state in 2 consecutive value functions overlap or lie within some tolerance (e) How to get policy ? Find actions which maximize value functions (when range is replaced by midpoints) Convergence ?

21 Variable reordering Intuitive ordering Random Rudell’s sifting
Variables which are correlated should be placed together Random Pick pairs of variables and swap them Rudell’s sifting Pick a variable, find a better position Experiments : Sifting did very well


Download ppt "Stochastic Planning using Decision Diagrams"

Similar presentations


Ads by Google