Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Department of Computer Science Undergraduate Events More
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Department of Computer Science Undergraduate Events More
Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Search and Planning for Inference and Learning in Computer Vision
Decision Making Under Uncertainty Lec #7: Markov Decision Processes UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Craig.
1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Department of Computer Science Undergraduate Events More
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Markov Decision Process (MDP)
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
Stochastic Planning using Decision Diagrams
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Symbolic Dynamic Programming
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)

The Problem Standard MDP algorithms require explicit state space enumeration Curse of dimensionality Need: Compact Representation (intuition: STRIPS) Need: versions of standard dynamic programming algorithms for it

A Glimpse of the Future Policy Tree Value Tree

A Glimpse of the Future: Some Experimental Results

Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

MDPs- Reminder (states, actions, transitions, rewards) Discounted infinite-horizon Stationary Policies (an action to take at state s) Value functions: is k-stage-to-go value function for π

Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

Representing MDPs as Bayesian Networks: Coffee world O: Robot is in office W: Robot is wet U: Has umbrella R: It is raining HCR: Robot has coffee HCO: Owner has coffee Go: Switch location BuyC: Buy coffee DelC: Deliver coffee GetU: Get umbrella The effect of the actions might be noisy. Need to provide a distribution for each effect.

Representing Actions: DelC

Representing Actions: Interesting Points No need to provide marginal distribution over pre-action variables Markov Property: we need only the previous state For now, no synchronic arcs Frame Problem? Single Network vs. a network for each action Why Decision Trees?

Representing Reward Generally determined by a subset of features.

Policies and Value Functions Policy TreeValue Tree The optimal choice may depend only on certain variables (given some others). Features HCR=T HCR=F ValuesActions

Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

Bellman Backup Q-Function: The value of performing a in s, given value function v Value Iteration- Reminder

Structured Value Iteration- Overview Input: Tree( ). Output: Tree( ). 1. Set Tree( )= Tree( ) 2. Repeat (a) Compute Tree( )= Regress(Tree( ),a) for each action a (b) Merge (via maximization) trees Tree( ) to obtain Tree( ) Until termination criterion. Return Tree( ).

Example World

Step 2a: Calculating Q-Functions 1. Expected Future Value 2. Discounting Future Value 3. Adding Immediate Reward How to use the structure of the trees? Tree( ) should distinguish only conditions under which a makes a branch of Tree(V) true with different odds.

Calculating : Tree(V 0 )PTree( ) Finding conditions under which a will have distinct expected value, with respect to V 0 FVTree( ) Undiscounted Expected Future Value for performing action a with one-stage-to-go. Tree( ) Discounting FVTree (by 0.9), and adding the immediate reward function. Z: 1*10+ 0*0

An Alternative View:

(a more complicated example) Tree(V 1 ) Partial PTree( ) Unsimplified PTree( ) FVTree( )Tree( )

The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified)

The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Pr b =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with v b.

The Algorithm: Regress Input: Tree(V), action a. Output: Tree( ) 1. PTree( )= PRegress(Tree(V),a) (simplified) 2. Construct FVTree( ): for each branch b of PTree, with leaf node l(b) (a) Pr b =the product of individual distr. from l(b) (b) (c) Re-label leaf l(b) with v b. 3. Discount FVTree( ) with, append Tree(R) 4. Return FVTree( )

The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X)

The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on

The Algorithm: PRegress Input: Tree(V), action a. Output: PTree( ) 1. If Tree(V) is a single node, return emptyTree 2. X = the variable at the root of Tree(V) = the tree for CPT(X) (label leaves with X) 3. = the subtrees of Tree(V) for X=t, X=f 4. = call PRegress on 5. For each leaf l in, add or both (according to distribution. Use union to combine labels) 6. Return

Step 2b. Maximization Value Iteration Complete.

Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

Experimental Results Worst Case: Best Case:

Roadmap MDPs- Reminder Structured Representation for MDPs: Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions

Synchronic edges POMDPs Rewards Approximation

Questions?

Backup slides Here be dragons.

Regression through a Policy

Improving Policies: Example

Maximization Step, Improved Policy