Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)

Similar presentations


Presentation on theme: "Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)"— Presentation transcript:

1 Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)

2 The Problem Standard MDP algorithms require explicit state space enumeration Curse of dimensionality Need: Compact Representation (intuition: STRIPS) Need: versions of standard dynamic programming algorithms for it

3 A Glimpse of the Future Policy Tree Value Tree

4 A Glimpse of the Future: Some Experimental Results

5 Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

6 MDPs- Reminder (states, actions, transitions, rewards) Discounted infinite-horizon Stationary Policies (an action to take at state s) Value functions: is k-stage-to-go value function for π

7 Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

8 Representing MDPs as Bayesian Networks: Coffee world O: Robot is in office W: Robot is wet U: Has umbrella R: It is raining HCR: Robot has coffee HCO: Owner has coffee Go: Switch location BuyC: Buy coffee DelC: Deliver coffee GetU: Get umbrella The effect of the actions might be noisy. Need to provide a distribution for each effect.

9 Representing Actions: DelC 0 0.3 0 0

10 Representing Actions: Interesting Points No need to provide marginal distribution over pre-action variables Markov Property: we need only the previous state For now, no synchronic arcs Frame Problem? Single Network vs. a network for each action Why Decision Trees?

11 Representing Reward Generally determined by a subset of features.

12 Policies and Value Functions Policy TreeValue Tree The optimal choice may depend only on certain variables (given some others). Features HCR=T HCR=F ValuesActions

13 Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

14 Bellman Backup Q-Function: The value of performing a in s, given value function v Value Iteration- Reminder

15 Structured Value Iteration- Overview Input: Tree( ). Output: Tree( ). 1. Set Tree( )= Tree( ) 2. Repeat (a) Compute Tree( )= Regress(Tree( ),a) for each action a (b) Merge (via maximization) trees Tree( ) to obtain Tree( ) Until termination criterion. Return Tree( ).

16 Example World

17 Step 2a: Calculating Q-Functions 1. Expected Future Value 2. Discounting Future Value 3. Adding Immediate Reward How to use the structure of the trees? Tree( ) should distinguish only conditions under which a makes a branch of Tree(V) true with different odds.

18 Calculating : Tree(V 0 )PTree( ) Finding conditions under which a will have distinct expected value, with respect to V 0 FVTree( ) Undiscounted Expected Future Value for performing action a with one-stage-to-go. Tree( ) Discounting FVTree (by 0.9), and adding the immediate reward function. Z: 1*10+ 0*0

19 An Alternative View:

20 (a more complicated example) Tree(V 1 ) Partial PTree( ) Unsimplified PTree( ) FVTree( )Tree( )

21 The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified)

22 The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Pr b =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with v b.

23 The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Pr b =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with v b. 3. Discount FVTree( ) with, append Tree(R) 4. Return FVTree( )

24 The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X)

25 The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on

26 The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on 5. For each leaf l in, add or both (according to distribution. Use union to combine labels) 6. Return

27 Step 2b. Maximization Value Iteration Complete.

28 Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

29 Experimental Results Worst Case: Best Case:

30 Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

31 Synchronic edges POMDPs Rewards Approximation

32 Questions?

33 Backup slides Here be dragons.

34 Regression through a Policy

35 Improving Policies: Example

36 Maximization Step, Improved Policy


Download ppt "Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)"

Similar presentations


Ads by Google