Stochastic Planning using Decision Diagrams

Slides:



Advertisements
Similar presentations
Model Checking Lecture 4. Outline 1 Specifications: logic vs. automata, linear vs. branching, safety vs. liveness 2 Graph algorithms for model checking.
Advertisements

1 Constraint Satisfaction Problems A Quick Overview (based on AIMA book slides)
Markov Decision Process
CS357 Lecture: BDD basics David Dill 1. 2 BDDs (Boolean/binary decision diagrams) BDDs are a very successful representation for Boolean functions. A BDD.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
Nov 14 th  Homework 4 due  Project 4 due 11/26.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
CS121 Heuristic Search Planning CSPs Adversarial Search Probabilistic Reasoning Probabilistic Belief Learning.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)
Identifying Reversible Functions From an ROBDD Adam MacDonald.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Daniel Kroening and Ofer Strichman 1 Decision Procedures An Algorithmic Point of View BDDs.
MDPs (cont) & Reinforcement Learning
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
1 Automated Planning and Decision Making 2007 Automated Planning and Decision Making Prof. Ronen Brafman Various Subjects.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Lecture 3: Uninformed Search
Markov Decision Process (MDP)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Binary Decision Diagrams
CS 188: Artificial Intelligence Fall 2007
Instructor: Vincent Conitzer
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Reinforcement Learning (2)
Markov Decision Processes
Normal Form (Matrix) Games
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Stochastic Planning using Decision Diagrams Sumit Sanghai

Stochastic Planning MDP model ? Finite set of states S Set of actions A, each having a transition probability matrix Fully observable A reward function R associated with each state

Stochastic Planning … Goal ? Problem ? Solution ? Policy which maximizes the expected total discounted reward in an infinite horizon model Policy is a mapping from state to action Problem ? Total reward can be infinite Solution ? Associate a discount factor b

Expected Reward Model Expected Reward Model Optimality ? Vp(s) = R(s) + b åt Pr(s,p(s),t) Vp(t) Optimality ? Policy p is optimal if Vp >= Vp` for all s and p` Thm : There exists an optimal policy Its value function is denoted as V*

Value Iteration Vn+1(s) = R(s) + maxa b åt Pr(s,a,t) Vn(t) V0(s) = R(s) Stopping condition : if maxs {Vn+1(s) – Vn(s)} < e(1-b) / 2b then Vn+1 is e/2 close to V* Thm : Value Iteration gives optimal policy Problem ? Can be slow if state space too large

Boolean Decision Diagrams Graph Representation of Boolean Functions BDD = Decision Tree – redundancies Remove duplicate nodes Remove any node with both child pointers pointing to the same child

BDDs… x x x 1 2 3 f 1 1 1 x1 x2 x3 1 x1 x2 x3 1 1 1 1 1 1 1 1 1 1 1 1 1 1

BDDs operations 1 2 1 or 1 2 =

Variable ordering (a1 ^ b1) v (a2 ^ b2) v (a3 ^ b3) Linear Growth Exponential Growth

BDD no magic Number of boolean functions With polynomial nodes 2^{2^n} Exponential functions

ADDs BDDs + real valued domain Useful to represent probabilities

MDP, State Space and ADDs Factored MDP S characterized by {X1, X2, …, Xn} Action a from s to s`  a from {X1, X2, …, Xn} to {X1`, X2`,…,Xn`} Pr(s,a,s`) ? Pra(Xi`|X1, X2,…,Xn) Each can be represented using ADD

Value Iteration Using ADDs Vn+1(s) = R(s) + b maxa {åt P(s,a,t) Vn(t)} R(s) : ADD P(s,a,t)=Pa(X1`=x1`,…,Xn`=xn`|X1=x1,…,Xn=xn) = ÕiP(Xi`=xi`|X1=x1,…,Xn=xn)  Vn+1(X1,…,Xn) = R(X1,…Xn) + b maxa {åX1`,…,Xn` Õi Pa(Xi`|X1,…,Xn) Vn(X1`,…,Xn`)} 2nd term on RHS can be obtained by quantifying X1` first as true or false and multiplying its associated ADD with Vn and summing over all possibilities to eliminate X1` åX1`,…,Xn` {Õi=2 to n {Pa(Xi`|X1,…,Xn)} (Pa(X1`=true|X1,…, Xn) Vn(X1`=true,…,Xn`) + Pa(X1`=false|X1,…, X_n) Vn(X1`=false,…,Xn`) )}

Value Iteration Using ADDs (other possibilities) Which variables are necessary ? variables appearing in the value function Order of variables during elimination ? Inverse order Problem ? Repeated computation of Pr(s,a,t) Solution ? Precompute Pa(X1`,…,Xn`|X1,…,Xn) Mutliply the dual action diagrams

Value Iteration… Space vs Time ? Solution (do something intermediate) Precomputation : huge space required No precomputation : time wasted Solution (do something intermediate) Divide variables into sets (restriction ??) and precompute for them Problems with precomputation Precomputation for sets containing variables which do not appear in value function Dynamic precomputation

Experiments Goals ? SPUDD vs Normal value iteration What is SPI ? How is comparison done ? Worst case of SPUDD Missing links ? SPUDD vs Others Space vs Time experiments

Future Work Variable reordering Policy iteration Approximate ADDs Formal model for structure exploitation BDDs eg. Symmetry detection First order ADDs

Approximate ADDs x1 x2 x3 0.9 1.1 0.1 6.7 x1 x2 x3 [0.9,1.1] 0.1 6.7

Approximate ADDs At each leaf node ? Range : [min,max] What value and error do you associate with that leaf ? How and till when to merge the leaves ? max_size vs max_error Max_size mode Merge closest pairs of leaves till size < max_size Max_error mode Merge pairs such that error < max_error

Approximate Value Iteration Vn+1 from Vn ? At each leaf node do calculation for both min and max : eg [min1,max1]*[min2,max2] = [min1*min2, max1*max2] What about maxa step ? Reduce again When to stop ? When the ranges for every state in 2 consecutive value functions overlap or lie within some tolerance (e) How to get policy ? Find actions which maximize value functions (when range is replaced by midpoints) Convergence ?

Variable reordering Intuitive ordering Random Rudell’s sifting Variables which are correlated should be placed together Random Pick pairs of variables and swap them Rudell’s sifting Pick a variable, find a better position Experiments : Sifting did very well