Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)

The Problem Standard MDP algorithms require explicit state space enumeration Curse of dimensionality Need: Compact Representation (intuition: STRIPS) Need: versions of standard dynamic programming algorithms for it

A Glimpse of the Future Policy Tree Value Tree

A Glimpse of the Future: Some Experimental Results

Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

MDPs- Reminder (states, actions, transitions, rewards) Discounted infinite-horizon Stationary Policies (an action to take at state s) Value functions: is k-stage-to-go value function for π

Representing MDPs as Bayesian Networks: Coffee world O: Robot is in office W: Robot is wet U: Has umbrella R: It is raining HCR: Robot has coffee HCO: Owner has coffee Go: Switch location BuyC: Buy coffee DelC: Deliver coffee GetU: Get umbrella The effect of the actions might be noisy. Need to provide a distribution for each effect.

Representing Actions: DelC 0 0.3 0 0

Representing Actions: Interesting Points No need to provide marginal distribution over pre-action variables Markov Property: we need only the previous state For now, no synchronic arcs Frame Problem? Single Network vs. a network for each action Why Decision Trees?

Representing Reward Generally determined by a subset of features.

Policies and Value Functions Policy TreeValue Tree The optimal choice may depend only on certain variables (given some others). Features HCR=T HCR=F ValuesActions

Bellman Backup Q-Function: The value of performing a in s, given value function v Value Iteration- Reminder

Structured Value Iteration- Overview Input: Tree( ). Output: Tree( ). 1. Set Tree( )= Tree( ) 2. Repeat (a) Compute Tree( )= Regress(Tree( ),a) for each action a (b) Merge (via maximization) trees Tree( ) to obtain Tree( ) Until termination criterion. Return Tree( ).

Example World

Step 2a: Calculating Q-Functions 1. Expected Future Value 2. Discounting Future Value 3. Adding Immediate Reward How to use the structure of the trees? Tree( ) should distinguish only conditions under which a makes a branch of Tree(V) true with different odds.

Calculating : Tree(V 0 )PTree( ) Finding conditions under which a will have distinct expected value, with respect to V 0 FVTree( ) Undiscounted Expected Future Value for performing action a with one-stage-to-go. Tree( ) Discounting FVTree (by 0.9), and adding the immediate reward function. Z: 1*10+ 0*0

An Alternative View:

(a more complicated example) Tree(V 1 ) Partial PTree( ) Unsimplified PTree( ) FVTree( )Tree( )

The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified)

The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Pr b =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with v b.

The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Pr b =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with v b. 3. Discount FVTree( ) with, append Tree(R) 4. Return FVTree( )

The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X)

The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on

The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on 5. For each leaf l in, add or both (according to distribution. Use union to combine labels) 6. Return

Step 2b. Maximization Value Iteration Complete.

Experimental Results Worst Case: Best Case:

Synchronic edges POMDPs Rewards Approximation

Questions?

Backup slides Here be dragons.

Regression through a Policy

Improving Policies: Example

Maximization Step, Improved Policy

Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)

Similar presentations

Presentation on theme: "Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)

Similar presentations

Presentation on theme: "Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)"— Presentation transcript:

Similar presentations

About project

Feedback