Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)
The Problem Standard MDP algorithms require explicit state space enumeration Curse of dimensionality Need: Compact Representation (intuition: STRIPS) Need: versions of standard dynamic programming algorithms for it
A Glimpse of the Future Policy Tree Value Tree
A Glimpse of the Future: Some Experimental Results
Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
MDPs- Reminder (states, actions, transitions, rewards) Discounted infinite-horizon Stationary Policies (an action to take at state s) Value functions: is k-stage-to-go value function for π
Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Representing MDPs as Bayesian Networks: Coffee world O: Robot is in office W: Robot is wet U: Has umbrella R: It is raining HCR: Robot has coffee HCO: Owner has coffee Go: Switch location BuyC: Buy coffee DelC: Deliver coffee GetU: Get umbrella The effect of the actions might be noisy. Need to provide a distribution for each effect.
Representing Actions: DelC
Representing Actions: Interesting Points No need to provide marginal distribution over pre-action variables Markov Property: we need only the previous state For now, no synchronic arcs Frame Problem? Single Network vs. a network for each action Why Decision Trees?
Representing Reward Generally determined by a subset of features.
Policies and Value Functions Policy TreeValue Tree The optimal choice may depend only on certain variables (given some others). Features HCR=T HCR=F ValuesActions
Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Bellman Backup Q-Function: The value of performing a in s, given value function v Value Iteration- Reminder
Structured Value Iteration- Overview Input: Tree( ). Output: Tree( ). 1. Set Tree( )= Tree( ) 2. Repeat (a) Compute Tree( )= Regress(Tree( ),a) for each action a (b) Merge (via maximization) trees Tree( ) to obtain Tree( ) Until termination criterion. Return Tree( ).
Example World
Step 2a: Calculating Q-Functions 1. Expected Future Value 2. Discounting Future Value 3. Adding Immediate Reward How to use the structure of the trees? Tree( ) should distinguish only conditions under which a makes a branch of Tree(V) true with different odds.
Calculating : Tree(V 0 )PTree( ) Finding conditions under which a will have distinct expected value, with respect to V 0 FVTree( ) Undiscounted Expected Future Value for performing action a with one-stage-to-go. Tree( ) Discounting FVTree (by 0.9), and adding the immediate reward function. Z: 1*10+ 0*0
An Alternative View:
(a more complicated example) Tree(V 1 ) Partial PTree( ) Unsimplified PTree( ) FVTree( )Tree( )
The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified)
The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Pr b =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with v b.
The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Pr b =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with v b. 3. Discount FVTree( ) with, append Tree(R) 4. Return FVTree( )
The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X)
The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on
The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on 5. For each leaf l in, add or both (according to distribution. Use union to combine labels) 6. Return
Step 2b. Maximization Value Iteration Complete.
Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Experimental Results Worst Case: Best Case:
Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Synchronic edges POMDPs Rewards Approximation
Questions?
Backup slides Here be dragons.
Regression through a Policy
Improving Policies: Example
Maximization Step, Improved Policy