1 GraphPlan, Satplan and Markov Decision Processes Sungwook Yoon* * Based in part on slides by Alan Fern
2 GraphPlan Many planning systems use ideas from Graphplan: IPP, STAN, SGP, Blackbox, Medic Can run much faster than POP-style planners History Before GraphPlan came out, most planning researchers were working on POP-style planners GraphPlan started them thinking about other more efficient algorithms Recent planning algorithms run much faster than GraphPlan However, most of them of them have been influenced by GraphPlan
3 Big Picture A big source of inefficiency in search algorithms is the large branching factor GraphPlan reduces the branching factor by searching in a special data structure Phase 1 – Create a Planning Graph built from initial state contains actions and propositions that are possibly reachable from initial state does not include unreachable actions or propositions Phase 2 - Solution Extraction Backward search for the Search for solution in the planning graph backward from goal
4 Planning Graph Sequence of levels that correspond to time-steps in the plan: Each level contains a set of literals and a set of actions Literals are those that could possibly be true at the time step Actions are those that their preconditions could be satisfied at the time step. Idea: construct superset of literals that could be possible achieved after an n-level layered plan Gives a compact (but approximate) representation of states that are reachable by n level plans A literal is just a positive or negative propositon
5 Planning Graph … … … … s0s0 snsn … … … … … … anan S n+1 propositions actions state-level 0: propositions true in s 0 state-level n: literals that may possibly be true after some n level plan action-level n: actions that may possibly be applicable after some n level plan
6 Planning Graph … … … … … … … … … … propositions actions maintenance action (persistence actions) represents what happens if no action affects the literal include action with precondition c and effect c, for each literal c
7 Graph expansion Initial proposition layer Just the initial conditions Action layer n If all of an action’s preconditions are in proposition layer n, then add action to layer n Proposition layer n+1 For each action at layer n (including persistence actions) Add all its effects (both positive and negative) at layer n+1 (Also allow propositions at layer n to persist to n+1 Propagate mutex information (we’ll talk about this in a moment)
8 Example holding(A) clear(B) holding(A) ~holding(A) clear(B) on(A,B) handempty ~clear(B) stack(a,b) stack(A,B) precondition: holding(A), clear(B) effect: ~holding(A), ~clear(B), on(A,B), clear(B), handempty s0a0s1
9 Example holding(A) clear(B) holding(A) ~holding(A) clear(B) on(A,B) handempty ~clear(B) stack(A,B) precondition: holding(A), clear(B) effect: ~holding(A), ~clear(B), on(A,B), clear(B), handempty s0a0s1 Notice that not all literals in s1 can be made true simultaneously after 1 level: e.g. holding(A), ~holding(A) and on(A,B), clear(B)
10 Mutual Exclusion (Mutex) Between pairs of actions no valid plan could contain both at layer n E.g., stack(a,b), unstack(a,b) Between pairs of literals no valid plan could produce both at layer n E.g., clear(a), ~clear(a) on(a,b), clear(b) GraphPlan checks pairs only mutex relationships can help rule out possibilities during search in phase 2 of Graphplan
11 Solution Extraction: Backward Search Repeat until goal set is empty If goals are present & non-mutex: 1) Choose set of non-mutex actions to achieve each goal 2) Add preconditions to next goal set
12 Searching for a solution plan Backward chain on the planning graph Achieve goals level by level At level k, pick a subset of non-mutex actions to achieve current goals. Their preconditions become the goals for k-1 level. Build goal subset by picking each goal and choosing an action to add. Use one already selected if possible (backtrack if can’t pick non-mutex action) If we reach the initial proposition level and the current goals are in that level (i.e. they are true in the initial state) then we have found a successful layered plan
13 GraphPlan algorithm Grow the planning graph (PG) to a level n such that all goals are reachable and not mutex necessary but insufficient condition for the existence of an n level plan that achieves the goals if PG levels off before non-mutex goals are achieved then fail Search the PG for a valid plan If none found, add a level to the PG and try again If the PG levels off and still no valid plan found, then return failure Correctness follows from PG properties
14 Important Ideas Plan graph construction is polynomial time Though construction can be expensive when there are many “objects” and hence many propositions The plan graph captures important properties of the planning problem Necessarily unreachable literals and actions Possibly reachable literals and actions Mutually exclusive literals and actions Significantly prunes search space compared to POP style planners The plan graph provides a sound termination procedure Knows when no plan exists Plan graphs can also be used for deriving admissible (and good non-admissible) heuristics See your book (we may come back to this idea later)
15 Encoding Planning as Satisfiability: Basic Idea Bounded planning problem (P,n): P is a planning problem; n is a positive integer Find a solution for P of length n Create a propositional formula that represents: Initial state Goal Action Dynamics for n time steps We will define the formula for (P,n) such that: 1) any model (i.e. satisfying truth assignment) of the formula represent a solution to (P,n) 2) if (P,n) has a solution then the formula is satisfiable
16 Example of Complete Formula for (P,1) [ at(r1,l1,0) at(r1,l2,0) ] at(r1,l2,1) [ move(r1,l1,l2,0) at(r1,l1,0) ] [ move(r1,l1,l2,0) at(r1,l2,1) ] [ move(r1,l1,l2,0) at(r1,l1,1) ] [ move(r1,l2,l1,0) at(r1,l2,0) ] [ move(r1,l2,l1,0) at(r1,l1,1) ] [ move(r1,l2,l1,0) at(r1,l2,1) ] [ move(r1,l1,l2,0) move(r1,l2,l1,0) ] [ at(r1,l1,0) at(r1,l1,1) move(r1,l2,l1,0) ] [ at(r1,l2,0) at(r1,l2,1) move(r1,l1,l2,0) ] [ at(r1,l1,0) at(r1,l1,1) move(r1,l1,l2,0) ] [ at(r1,l2,0) at(r1,l2,1) move(r1,l2,l1,0) ] We’ll now discuss how to construct such a formula Formula has propositions for actions and states variables at each possible timestep
17 Overall Approach Do iterative deepening like we did with Graphplan: for n = 0, 1, 2, …, encode (P,n) as a satisfiability problem if is satisfiable, then From the set of truth values that satisfies , a solution plan can be constructed, so return it and exit With a complete satisfiability tester, this approach will produce optimal layered plans for solvable problems We can use a GraphPlan analysis to determine an upper bound on n, giving a way to detect unsolvability
18 Fluents (will be used as propositons) If plan a 0, a 1, …, a n–1 is a solution to (P,n), then it generates a sequence of states s 0, s 1, …, s n–1 A fluent is a proposition used to describe what’s true in each s i on(A,B,i) is a fluent that’s true iff at(r1,loc1) is in s i We’ll use e i to denote the fluent for a fact e in state s i e.g. if e = at(r1,loc1) then e i = at(r1,loc1,i) a i is a fluent saying that a is a action taken at step i e.g., if a = move(r1,loc2,loc1) then a i = move(r1,loc2,loc1,i) The set of all possible fluents for (P,n) form the set of primitive propositions used to construct our formula for (P,n)
19 Encoding Planning Problems We can encode (P,n) so that we consider either layered plans or totally ordered plans an advantage of considering layered plans is that fewer time steps are necessary (i.e. smaller n translates into smaller formulas) for simplicity we first consider totally-ordered plans Encode (P,n) as a formula such that a 0, a 1, …, a n–1 is a solution for (P,n) if and only if can be satisfied in a way that makes the fluents a 0, …, a n–1 true will be conjunction of many other formulas …
20 Formulas in Formula describing the initial state: (let E be the set of possible facts in the planning problem) /\ {e 0 | e s 0 } /\ { e 0 | e E – s 0 } Describes the complete initial state (both positive and negative fact) E.g. on(A,B,0) on(B,A,0) Formula describing the goal: (G is set of goal facts) /\ {e n | e G} says that the goal facts must be true in the final state at timestep n E.g. on(B,A,n) Is this enough? Of course not. The formulas say nothing about actions.
21 Formulas in For every action a and timestep i, formula describing what fluents must be true if a were the i’th step of the plan: a i /\ {e i | e Precond(a)}, a’s preconditions must be true a i /\ {e i+1 | e ADD(a)}, a’s ADD effects must be true in i+1 a i /\ { e i+1 | e DEL(a)}, a’s DEL effects must be false in i+1 Complete exclusion axiom: For all actions a and b and timesteps i, formulas saying a and b can’t occur at the same time a i b i this guarantees there can be only one action at a time Is this enough? The formulas say nothing about what happens to facts if they are not effected by an action This is known as the frame problem
22 Example Planning domain: one robot r1 two adjacent locations l1, l2 one operator (move the robot) Encode (P,n) where n = 1 Initial state:{at(r1,l1)} Encoding:at(r1,l1,0) at(r1,l2,0) Goal:{at(r1,l2)} Encoding:at(r1,l2,1) Action Schema: see next slide
23 Extracting a Plan Suppose we find an assignment of truth values that satisfies . This means P has a solution of length n For i=0,…,n-1, there will be exactly one action a such that a i = true This is the i’th action of the plan. Example (from the previous slides): can be satisfied with move(r1,l1,l2,0) = true Thus move(r1,l1,l2,0) is a solution for (P,0) It’s the only solution - no other way to satisfy
24 What SATPLAN Shows General propositional reasoning can compete with state of the art specialized planning systems New, highly tuned variations of DP surprising powerful Radically new stochastic approaches to SAT can provide very low exponential scaling Why does it work? More flexible than forward or backward chaining Randomized algorithms less likely to get trapped along bad paths
25 BlackBox (GraphPlan + SatPlan) The BlackBox procedure combines planning-graph expansion and satisfiability checking It is roughly as follows: for n = 0, 1, 2, … Graph expansion: create a “planning graph” that contains n “levels” Check whether the planning graph satisfies a necessary (but insufficient) condition for plan existence If it does, then Encode (P,n) as a satisfiability problem but include only the actions in the planning graph If is satisfiable then return the solution
26 Blackbox STRIPS Plan Graph Mutex computation CNF General Stochastic / Systematic SAT engines Solution Simplifier Translator CNF Can be thought of as an implementation of GraphPlan that uses an alternative plan extraction technique than the backward chaining of GraphPlan.
27 Percepts Actions ???? World perfect fully observable instantaneous deterministic Classical Planning Assumptions sole source of change
28 Percepts Actions ???? World perfect fully observable instantaneous stochastic Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model sole source of change
29 Types of Uncertainty Disjunctive (used by non-deterministic planning) Next state could be one of a set of states. Stochastic/Probabilistic Next state is drawn from a probability distribution over the set of states. How are these models related?
30 Markov Decision Processes An MDP has four components: S, A, R, T: (finite) state set S (|S| = n) (finite) action set A (|A| = m) (Markov) transition function T(s,a,s’) = Pr(s’ | s,a) Probability of going to state s’ after taking action a in state s How many parameters does it take to represent? bounded, real-valued reward function R(s) Immediate reward we get for being in state s For example in a goal-based domain R(s) may equal 1 for goal states and 0 for all others Can be generalized to include action costs: R(s,a) Can be generalized to be a stochastic function Can easily generalize to countable or continuous state and action spaces (but algorithms will be different)
31 Graphical View of MDP StSt RtRt S t+1 AtAt R t+1 S t+2 A t+1 R t+2
32 Assumptions First-Order Markovian dynamics (history independence) Pr(S t+1 |A t,S t,A t-1,S t-1,..., S 0 ) = Pr(S t+1 |A t,S t ) Next state only depends on current state and current action First-Order Markovian reward process Pr(R t |A t,S t,A t-1,S t-1,..., S 0 ) = Pr(R t |A t,S t ) Reward only depends on current state and action As described earlier we will assume reward is specified by a deterministic function R(s) i.e. Pr(R t =R(S t ) | A t,S t ) = 1 Stationary dynamics and reward Pr(S t+1 |A t,S t ) = Pr(S k+1 |A k,S k ) for all t, k The world dynamics do not depend on the absolute time Full observability Though we can’t predict exactly which state we will reach when we execute an action, once it is realized, we know what it is
33 Policies (“plans” for MDPs) Nonstationary policy π :S x T → A, where T is the non-negative integers π (s,t) is action to do at state s with t stages-to-go What if we want to keep acting indefinitely? Stationary policy π: S → A π (s) is action to do at state s (regardless of time) specifies a continuously reactive controller These assume or have these properties: full observability history-independence deterministic action choice Why not just consider sequences of actions? Why not just replan?
34 Value of a Policy How good is a policy π ? How do we measure “accumulated” reward? Value function V: S →ℝ associates value with each state (or each state and time for non-stationary π) V π (s) denotes value of policy at state s Depends on immediate reward, but also what you achieve subsequently by following π An optimal policy is one that is no worse than any other policy at any state The goal of MDP planning is to compute an optimal policy (method depends on how we define value)
35 Policy Evaluation Value equation for fixed policy How can we compute the value function for a policy? we are given R and Pr simple linear system with n variables (each variables is value of a state) and n constraints (one value equation for each state) Use linear algebra (e.g. matrix inverse)
36 Value Iteration vs. Policy Iteration Which is faster? VI or PI It depends on the problem VI takes more iterations than PI, but PI requires more time on each iteration PI must perform policy evaluation on each step which involves solving a linear system Complexity: There are at most exp(n) policies, so PI is no worse than exponential time in number of states Empirically O(n) iterations are required Still no polynomial bound on the number of PI iterations (open problem)!