Symbolic Dynamic Programming

Slides:

Advertisements

Similar presentations

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Advertisements

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Dynamic Bayesian Networks (DBNs)

Decision Theoretic Planning

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

Infinite Horizon Problems

Planning under Uncertainty

POMDPs: Partially Observable Markov Decision Processes Advanced AI

Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.

Markov Decision Processes

Representing Uncertainty CSE 473. © Daniel S. Weld 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one.

CPSC 322, Lecture 17Slide 1 Planning: Representation and Forward Search Computer Science cpsc322, Lecture 17 (Textbook Chpt 8.1 (Skip )- 8.2) February,

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)

Decision Making Under Uncertainty Lec #7: Markov Decision Processes UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Craig.

1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.

1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.

UIUC CS 498: Section EA Lecture #21 Reasoning in Artificial Intelligence Professor: Eyal Amir Fall Semester 2011 (Some slides from Kevin Murphy (UBC))

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Announcements Project 4: Ghostbusters Homework 7

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)

Intro to Planning Or, how to represent the planning problem in logic.

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

1 CSC 384 Lecture Slides (c) , C. Boutilier and P. Poupart CSC384: Lecture 24  Last time Variable elimination, decision trees  Today decision.

CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.

1 Automated Planning and Decision Making 2007 Automated Planning and Decision Making Prof. Ronen Brafman Various Subjects.

Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.

A Brief Introduction to Bayesian networks

Sequential Stochastic Models

Reinforcement learning (Chapter 21)

Reinforcement Learning (1)

Planning: Forward Planning and CSP Planning

Reinforcement learning (Chapter 21)

Approximate Inference

Planning: Representation and Forward Search

Planning: Representation and Forward Search

Exploiting Graphical Structure in Decision-Making

Representing Uncertainty

CMSC 471 Fall 2009 RL using Dynamic Programming

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming

MURI Kickoff Meeting Randolph L. Moses November, 2008

CAP 5636 – Advanced Artificial Intelligence

CS 188: Artificial Intelligence Fall 2007

Artificial Intelligence Chapter 20 Learning and Acting with Bayes Nets

CS 188: Artificial Intelligence

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Chapter 20. Learning and Acting with Bayes Nets

CS 416 Artificial Intelligence

Chapter 4: Dynamic Programming

Reinforcement Learning Dealing with Partial Observability

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Planning: Representation and Forward Search

Reinforcement Learning (2)

Presentation transcript:

Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier

Planning in Large State Space MDPs You have learned algorithms for computing optimal policies Value Iteration Policy Iteration These algorithms explicitly enumerate the state space Often this is impractical Simulation-based planning and RL allowed for approximate planning in large MDPs Did not utilize an explicit model of the MDP. Only used a strong or weak simulator. How can we get exact solutions to enormous MDPs?

Structured Representations Policy iteration and value iteration treat states as atomic entities with no internal structure. In most cases, states actually do have internal structure E.g. described by a set of state variables, or objects with properties and relationships Humans exploit this structure to plan effectively What if we had a compact, structured representation for a large MDP and could efficiently plan with it? Would allow for exact solutions to very large MDPs

A Planning Problem

Logical or Feature-based Problems For most AI problems, states are not viewed as atomic entities. They contain structure. For example, they are described by a set of boolean propositions/variables |S| exponential in number of propositions Basic policy and value iteration do nothing to exploit the structure of the MDP when it is available

Solution? Require structured representations in terms of propositions compactly represent transition function compactly represent reward function compactly represent value functions and policies Require structured computation perform steps of PI or VI directly on structured representations can avoid the need to enumerate state space We start by representing the transition structure as dynamic Bayesian networks

Propositional Representations States decomposable into state variables (we will assume boolean variables) Structured representations the norm in AI Decision diagrams, Bayesian networks, etc. Describe how actions affect/depend on features Natural, concise, can be exploited computationally Same ideas can be used for MDPs

Robot Domain as Propositional MDP Propositional variables for single user version Loc (robot’s locat’n): Office, Entrance T (lab is tidy): boolean CR (coffee request outstanding): boolean RHC (robot holding coffee): boolean RHM (robot holding mail): boolean M (mail waiting for pickup): boolean Actions/Events move to an adjacent location, pickup mail, get coffee, deliver mail, deliver coffee, tidy lab mail arrival, coffee request issued, lab gets messy Rewards rewarded for tidy lab, satisfying a coffee request, delivering mail (or penalized for their negation)

State Space State of MDP: assignment to these six variables grows exponentially with number of variables Transition matrices 4032 parameters required per matrix one matrix per action (6 or 7 or more actions) Reward function 64 reward values needed Factored state and action descriptions will break this exponential dependence (generally)

Dynamic Bayesian Networks (DBNs) Bayesian networks (BNs) a common representation for probability distributions A graph (DAG) represents conditional independence Conditional probability tables (CPTs) quantify local probability distributions Dynamic Bayes net action representation one Bayes net for each action a, representing the set of conditional distributions Pr(St+1|At,St) each state variable occurs at time t and t+1 dependence of t+1 variables on t variables depicted by directed arcs

DBN Representation: deliver coffee RHM R(t+1) R(t+1) T 1.0 0.0 F 0.0 1.0 RHMt RHMt+1 Pr(RHMt+1|RHMt) Mt Mt+1 T T(t+1) T(t+1) T 0.91 0.09 F 0.0 1.0 Pr(Tt+1| Tt) Tt Tt+1 L CR RHC CR(t+1) CR(t+1) O T T 0.2 0.8 E T T 1.0 0.0 O F T 0.1 0.9 E F T 0.1 0.9 O T F 1.0 0.0 E T F 1.0 0.0 O F F 0.1 0.9 E F F 0.1 0.9 Lt Lt+1 CRt CRt+1 RHCt RHCt+1 Pr(CRt+1 | Lt,CRt,RHCt) Pr 𝑆 𝑡+1 𝑆 𝑡 ,deliver coffee is the product of each of the 6 tables.

Benefits of DBN Representation Pr(St+1 | St) = Pr(RHMt+1,Mt+1,Tt+1,Lt+1,Ct+1,RHCt+1 | RHMt,Mt,Tt,Lt,Ct,RHCt) = Pr(RHMt+1 |RHMt) * Pr(Mt+1 | Mt) * Pr(Tt+1 | Tt) * Pr(Lt+1 | Lt) * Pr(CRt+1 | CRt,RHCt,Lt) * Pr(RHCt+1 | RHCt,Lt) Tt Lt CRt RHCt Tt+1 Lt+1 CRt+1 RHCt+1 RHMt RHMt+1 Mt Mt+1 Only 20 parameters vs. 4032 for matrix Removes global exponential dependence Full Matrix s1 s2 ... s64 s1 0.9 0.05 ... 0.0 s2 0.0 0.20 ... 0.1 S64 0.1 0.0 ... 0.0 .

Structure in CPTs So far we have represented each CPT as a table of size exponential in the number of parents Notice that there’s regularity in CPTs e.g., Pr(CRt+1 | Lt,CRt,RHCt) has many similar entries Compact function representations for CPTs can be used to great effect decision trees algebraic decision diagrams (ADDs/BDDs) Here we show examples of decision trees (DTs)

Action Representation – DBN/DT Tt Lt CRt RHCt Tt+1 Lt+1 CRt+1 RHCt+1 RHMt RHMt+1 Mt Mt+1 CR(t) Decision Tree (DT) t RHC(t) f t f L(t) e o 0.1 1.0 1.0 0.2 Leaves of DT give Pr(CRt+1=true | Lt,CRt,RHCt) e.g. If CR(t) = true & RHC(t) = false then CR(t+1)=TRUE with prob. 1 DTs can often represent conditional probabilities much more compactly than a full conditional probability table

Reward Representation Rewards represented with DTs in a similar fashion Would require vector of size 2n for explicit representation CR f t -100 High cost for unsatisfied coffee request M f t T High, but lower, cost for undelivered mail -10 f t Cost for lab being untidy -1 1 Small reward for satisfying all of these conditions

Structured Computation Given our compact decision tree (DBN) representation, can we solve MDP without explicit state space enumeration? Can we avoid O(|S|)-computations by exploiting regularities made explicit by representation? We will study a general approach for doing this called structured dynamic programming

Structured Dynamic Programming We now consider how to perform dynamic programming techniques such as VI and PI using the problem structure VI and PI are based on a few basic operations. Here we will show how to perform these operations directly on tree representations of value functions, policies, and transitions functions The approach is very general and can be applied to other representations (e.g. algebraic decision diagrams, situation calculus) and other problems after the main idea is understood We will focus on VI here, but the paper also describes a version of modified policy iteration

Recall Tree-Based Representations Note: we are leaving off time subscripts for readability and using X(t), Y(t), …, instead. X t f X X e.g. If X(t)=false & Y(t) = true then Y(t+1)=true w/ prob 1 e.g. If X(t)=true THEN Y(t+1)=true w/ prob 0.9 1.0 0.0 X t f 0.9 Y Y Y t f Z 1.0 0.0 Z t f t f Z Z 1.0 Y 10 t f 0.9 0.0 Reward Function R DBN for Action A Recall that each action of the MDP has its own DBN.

Structured Dynamic Programming Value functions and policies can also have tree representations Often much more compact representations than tables Our Goal: compute the tree representations of policy and value function given the tree representations of the transitions and rewards

Recall Value Iteration 𝑉 0 𝑠 =𝑅 𝑠 ;; could initialize to 0 𝑄 𝑘+1 𝑎 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ Pr 𝑠 ′ 𝑠,𝑎 𝑉 𝑘 ( 𝑠 ′ ) 𝑉 𝑘+1 𝑠 = max 𝑎 𝑄 𝑘+1 𝑎 (𝑠) Bellman Backup Suppose that initial 𝑉 𝑘 is compactly represented as a tree. Show how to compute compact trees for 𝑄 𝑘+1 𝑎 1 , …, 𝑄 𝑘+1 𝑎 𝑛 Use a max operation on the Q-trees (returns a single tree)

Symbolic Value Iteration Pr 𝐴= 𝑎 𝑆′ 𝑆) . . . . . Pr 𝐴= 𝑏 𝑆′ 𝑆) Pr 𝐴= 𝑧 𝑆′ 𝑆) ? . . . . . Tree 𝑄 𝐴=𝑎 (𝑋) Tree 𝑄 𝐴=𝑏 (𝑋) Tree 𝑄 𝐴=𝑧 (𝑋) Symbolic MAX Tree V(X)

The MAX Trees Operation 0.0 1.0 0.0 0.0 0.9 1.0 0.9 Y 1.0 1.0 0.0 Tree partitions the state space, assigning values to each region The state space max for the above trees is: In general, how can we compute the tree representing the max? 0.0 1.0 1.0

The MAX Tree Operation X X 0.0 1.0 0.0 0.9 1.0 0.0 0.9 Y 1.0 1.0 0.0 Can simply append one tree to leaves of other. Makes all the distinctions that either tree makes. Max operation is taken at leaves of result. X X MAX X Y X Y 1.0 1.0, 0.9 0.9 0.0,0.9 X X X X 1.0 1.0 1.0,1.0 0.0, 1.0 1.0 0.0 1.0, 0.0 0.0, 0.0

The MAX Tree Operation X X 0.0 1.0 0.0 0.9 1.0 0.0 0.9 Y 1.0 1.0 0.0 The resulting tree may have unreachable leaves. We can simplify the tree by removing such paths. X X Simplify 1.0 Y X Y 1.0 0.9 1.0 0.0 X X 1.0 1.0 1.0 0.0 unreachable

BINARY OPERATION : ADDITION (other binary operations similar to max)

∑A MARGINALIZATION Compute diagram representing G B = 𝐴 𝐹(𝐴,𝐵) There are libraries for doing this.

Recall Value Iteration 𝑄 𝑘+1 𝑎 𝑠 =𝑅 𝑠,𝑎 +𝛾 𝑠 ′ Pr 𝑠 ′ 𝑠,𝑎 𝑉 𝑘 ( 𝑠 ′ ) 𝑉 𝑘+1 𝑠 = max 𝑎 𝑄 𝑘+1 𝑎 (𝑠) Factoring Q: 𝒔= 𝒙 𝟏 ,…, 𝒙 𝒍 , 𝒔′=( 𝒙′ 𝟏 ,…, 𝒙′ 𝒍 ) 𝑄 𝑘+1 𝑎 𝑠 =𝑅 𝑠,𝑎 + 𝑥 1 ′ ,.. 𝑥 𝑙 ′ Pr 𝑥 ′ 1 ,…, 𝑥 𝑙 ′ 𝑠,𝑎 𝑉 𝑘 (𝑠′) = 𝑅 𝑠,𝑎 + 𝑥 1 ′ Pr 𝑥 1 ′ 𝑠,𝑎) 𝑥 2 ′ Pr 𝑥 2 ′ 𝑠,𝑎 … 𝑥 𝑙 ′ Pr 𝑥 𝑙 ′ 𝑠,𝑎 𝑉 𝑘 ( 𝑠 ′ ) We can do all operations symbolically given trees for V, Pr, and R

Symbolic Bellman Backup 𝑋= 𝑋 1 ,…, 𝑋 𝑙 , 𝑋′=( 𝑋′ 1 ,…, 𝑋′ 𝑙 ) 𝑉 𝑛 ′ is 𝑉 𝑛 with 𝑋 𝑖 variables replaced by 𝑋 𝑖 ′ for each action a Tree Tree Tree Tree

Planning in Factored Action Spaces using Symblic Dynamic Programming

Symbolic 𝑄 𝑛+1 𝑎 Tree Tree

Symbol 𝑄 𝑛+1 𝑎 Tree

Symbol 𝑄 𝑛+1 𝑎 Tree

Symbol 𝑄 𝑛+1 𝑎 Tree Tree

Symbol 𝑄 𝑛+1 𝑎 Tree

Symbolic Bellman Backup for each action a Tree Tree Tree Tree

SDP: Relative Merits Adaptive, nonuniform, exact abstraction method provides exact solution to MDP much more efficient on certain problems (time/space) 400 million state problems in a couple hrs Can formulate a similar procedure for modified policy iteration Some drawbacks produces piecewise constant VF some problems admit no compact solution representation so the sizes of trees blows up with enough iterations approximation may be desirable or necessary

Approximate SDP Easy to approximate solution using SDP Simple pruning of value function Simply “merge” leaves that have similar values Can prune trees [BouDearden96] or ADDs [StaubinHoeyBou00] Gives regions of approximately same value

A Pruned Value ADD HCU Loc HCR HCU W HCR Loc W W W R R U R U U [7.45, 8.45] Loc HCR HCU [9.00, 10.00] [6.64, 7.64] [5.19, 6.19] W HCR 9.00 10.00 Loc W 5.19 W W R 7.45 R 6.64 U R 6.19 5.62 U U 8.45 8.36 7.64 6.81

Approximate SDP: Relative Merits Relative merits of ASDP fewer regions implies faster computation 30-40 billion state problems in a couple hours allows fine-grained control of time vs. solution quality with dynamic error bounds technical challenges: variable ordering, convergence, fixed vs. adaptive tolerance, etc. Some drawbacks (still) produces piecewise constant VF doesn’t exploit additive structure of VF at all Bottom-line: When a problem matches the structural assumptions of SDP then we can gain much. But many problems do not match assumptions.

Ongoing Work Factored action spaces Sometimes the action space is large, but has structure. For example, cooperative multi-agent systems Recent work (at OSU) has studied SDP for factored action spaces Include action variables in the DBNs State variables Action variables