11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30 Homework 4 will be due before the last class Next class: Review of MDPs (*please* read Chapter 16 and the class slides)
Sensing Actions Sensing actions in essence “partition” a belief state Sensing a formula f splits a belief state B to B&f; B&~f Both partitions need to be taken to the goal state now Tree plan AO* search Heuristics will have to compare two generalized AND branches In the figure, the lower branch has an expected cost of 11,000 The upper branch has a fixed sensing cost of based on the outcome, a cost of 7 or 12,000 If we consider worst case cost, we assume the cost is 12,300 If we consider both to be equally likey, we assume units cost If we know actual probabilities that the sensing action returns one result as against other, we can use that to get the expected cost… AsAs A 7 12,000 11,
Sensing: General observations Sensing can be thought in terms of Speicific state variables whose values can be found OR sensing actions that evaluate truth of some boolean formula over the state variables. Sense(p) ; Sense(pV(q&r)) A general action may have both causative effects and sensing effects Sensing effect changes the agent’s knowledge, and not the world Causative effect changes the world (and may give certain knowledge to the agent) A pure sensing action only has sensing effects; a pure causative action only has causative effects.
Progression/Regression with Sensing When applied to a belief state, AT RUN TIME the sensing effects of an action wind up reducing the cardinality of that belief state basically by removing all states that are not consistent with the sensed effects AT PLAN TIME, Sensing actions PARTITION belief states If you apply Sense-f? to a belief state B, you get a partition of B 1 : B&f and B 2 : B&~f You will have to make a plan that takes both partitions to the goal state Introduces branches in the plan If you regress two belief state B&f and B&~f over a sensing action Sense-f?, you get the belief state B
If a state variable p Is in B, then there is some action A p that Can sense whether p is true or false If P=B, the problem is fully observable If B is empty, the problem is non observable If B is a subset of P, it is partially observable Note: Full vs. Partial observability is independent of sensing individual fluents vs. sensing formulas. (assuming single literal sensing)
Full Observability: State Space partitioned to singleton Obs. Classes Non-observability: Entire state space is a single observation class Partial Observability: Between 1 and |S| observation classes
Hardness classes for planning with sensing Planning with sensing is hard or easy depending on: (easy case listed first) Whether the sensory actions give us full or partial observability Whether the sensory actions sense individual fluents or formulas on fluents Whether the sensing actions are always applicable or have preconditions that need to be achieved before the action can be done
A Simple Progression Algorithm in the presence of pure sensing actions Call the procedure Plan(B I,G,nil) where Procedure Plan(B,G,P) If G is satisfied in all states of B, then return P Non-deterministically choose: I. Non-deterministically choose a causative action a that is applicable in B. Return Plan(a(B),G,P+a) II. Non-deterministically choose a sensing action s that senses a formula f (could be a single state variable) Let p’ = Plan(B&f,G,nil); p’’=Plan(B&~f,G,nil) /*B f is the set of states of B in which f is true */ Return P+(s?:p’;p’’) If we always pick I and never do II then we will produce conformant Plans (if we succeed).
Remarks on the progression with sensing actions Progression is implicitly finding an AND subtree of an AND/OR Graph If we look for AND subgraphs, we can represent DAGS. The amount of sensing done in the eventual solution plan is controlled by how often we pick step I vs. step II (if we always pick I, we get conformant solutions). Progression is as clue-less as to whether to do sensing and which sensing to do, as it is about which causative action to apply Need heuristic support
Heuristics for sensing We need to compare the cumulative distance of B1 and B2 to goal with that of B3 to goal Notice that Planning cost is related to plan size while plan exec cost is related to the length of the deepest branch (or expected length of a branch) If we use the conformant belief state distance (as discussed last class), then we will be over estimating the distance (since sensing may allow us to do shorter branch) Bryce [ICAPS 05—submitted] starts wth the conformant relaxed plan and introduces sensory actions into the plan to estimate the cost more accurately B1 B2 B3
Sensing: More things under the mat (which we won’t lift for now ) Sensing extends the notion of goals (and action preconditions). Findout goals: Check if Rao is awake vs. Wake up Rao Presents some tricky issues in terms of goal satisfaction…! You cannot use “causative” effects to support “findout” goals But what if the causative effects are supporting another needed goal and wind up affecting the goal as a side-effect? (e.g. Have-gong-go-off & find-out-if-rao-is-awake) Quantification is no longer syntactic sugaring in effects and preconditions in the presence of sensing actions Rm* can satisfy the effect forall files remove(file); without KNOWING what are the files in the directory! This is alternative to finding each files name and doing rm Sensing actions can have preconditions (as well as other causative effects); they can have cost The problem of OVER-SENSING (Sort of like a beginning driver who looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project] Handling over-sensing using local-closedworld assumptions Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of the file after each and every action
Very simple Example A1 p=>r,~p A2 ~p=>r,p A3 r=>g O5 observe(p) Problem: Init: don’t know p Goal: g Plan: O5:p?[A1 A3][A2 A3] Notice that in this case we also have a conformant plan: A1;A2;A3 --Whether or not the conformant plan is cheaper depends on how costly is sensing action O5 compared to A1 and A2
Very simple Example A1 p=>r,~p A2 ~p=>r,p A3 r=>g O5 observe(p) Problem: Init: don’t know p Goal: g Plan: O5:p?[A1 A3][A2 A3] O5:p? A1 A3 A2 A3 Y N
A more interesting example: Medication The patient is not Dead and may be Ill. The test paper is not Blue. We want to make the patient be not Dead and not Ill We have three actions: Medicate which makes the patient not ill if he is ill Stain—which makes the test paper blue if the patient is ill Sense-paper—which can tell us if the paper is blue or not. No conformant plan possible here. Also, notice that I cannot be sensed directly but only through B This domain is partially observable because the states (~D,I,~B) and (~D,~I,~B) cannot be distinguished
“Goal directed” conditional planning Recall that regression of two belief state B&f and B&~f over a sensing action Sense-f will result in a belief state B Search with this definition leads to two challenges: 1.We have to combine search states into single ones (a sort of reverse AO* operation) 2.We may need to explicitly condition a goal formula in partially observable case (especially when certain fluents can only be indirectly sensed) Example is the Medicate domain where I has to be found through B If you have a goal state B, you can always write it as B&f and B&~f for any arbitrary f! (The goal Happy is achieved by achieving the twin goals Happy&rich as well as Happy&~rich) Of course, we need to pick the f such that f/~f can be sensed (i.e. f and ~f defines an observational class feature) This step seems to go against the grain of “goal-directedenss”—we may not know what to sense based on what our goal is after all! Regression for PO case is Still not Well-understood
Regresssion
Handling the “combination” during regression We have to combine search states into single ones (a sort of reverse AO* operation) Two ideas: 1.In addition to the normal regression children, also generate children from any pair of regressed states on the search fringe (has a breadth-first feel. Can be expensive!) [Tuan Le does this] 2.Do a contingent regression. Specifically, go ahead and generate B from B&f using Sense-f; but now you have to go “forward” from the “not-f” branch of Sense-f to goal too. [CNLP does this; See the example]
Need for explicit conditioning during regression (not needed for Fully Observable case) If you have a goal state B, you can always write it as B&f and B&~f for any arbitrary f! (The goal Happy is achieved by achieving the twin goals Happy&rich as well as Happy&~rich) Of course, we need to pick the f such that f/~f can be sensed (i.e. f and ~f defines an observational class feature) This step seems to go against the grain of “goal-directedenss”—we may not know what to sense based on what our goal is after all! Consider the Medicate problem. Coming from the goal of ~D&~I, we will never see the connection to sensing blue! Notice the analogy to conditioning in evaluating a probabilistic query
Similar processing can be done for regression (PO planning is nothing but least-committed regression planning) We now have yet another way of handling unsafe links --Conditioning to put the threatening step in a different world!
Sensing: More things under the mat Sensing extends the notion of goals too. Check if Rao is awake vs. Wake up Rao Presents some tricky issues in terms of goal satisfaction…! Handling quantified effects and preconditions in the presence of sensing actions Rm* can satisfy the effect forall files remove(file); without KNOWING what are the files in the directory! Sensing actions can have preconditions (as well as other causative effects) The problem of OVER-SENSING (Sort of like the initial driver; also Sphexishness) [XII/Puccini project] Handling over-sensing using local-closedworld assumptions Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of the file after each and every action A general action may have both causative effects and sensing effects Sensing effect changes the agent’s knowledge, and not the world Causative effect changes the world (and may give certain knowledge to the agent) A pure sensing action only has sensing effects; a pure causative action only has causative effects. The recent work on conditional planning has considered mostly simplistic sensing actions that have no preconditions and only have pure sensing effects. Sensing has cost!
11/24 Replanning MDPs [HW4 updated; See the paper task; Only MDP stuff to be added]
Sensing: More things under the mat (which we won’t lift for now ) Sensing extends the notion of goals (and action preconditions). Findout goals: Check if Rao is awake vs. Wake up Rao Presents some tricky issues in terms of goal satisfaction…! You cannot use “causative” effects to support “findout” goals But what if the causative effects are supporting another needed goal and wind up affecting the goal as a side-effect? (e.g. Have-gong-go-off & find-out-if-rao-is-awake) Quantification is no longer syntactic sugaring in effects and preconditions in the presence of sensing actions Rm* can satisfy the effect forall files remove(file); without KNOWING what are the files in the directory! This is alternative to finding each files name and doing rm Sensing actions can have preconditions (as well as other causative effects); they can have cost The problem of OVER-SENSING (Sort of like a beginning driver who looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project] Handling over-sensing using local-closedworld assumptions Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of the file after each and every action Review
Sensing: Limited Contingency planning In many real-world scenarios, having a plan that works in all contingencies is too hard An idea is to make a plan for some of the contingencies; and monitor/Replan as necessary. Qn: What contingencies should we plan for? The ones that are most likely to occur…(need likelihoods) Qn: What do we do if an unexpected contingency arises? Monitor (the observable parts of the world) When it goes out of expected world, replan starting from that state.
Things more complicated if the world is partially observable Need to insert sensing actions to sense fluents that can only be indirectly sensed
“Triangle Tables”
This involves disjunctive goals!
Replanning—Respecting Commitments In real-world, where you make commitments based on your plan, you cannot just throw away the plan at the first sign of failure One heuristic is to reuse as much of the old plan as possible while doing replanning. A more systematic approach is to 1.Capture the commitments made by the agent based on the current plan 2.Give these commitments as additional soft constraints to the planner
Replanning as a universal antidote… If the domain is observable and lenient to failures, and we are willing to do replanning, then we can always handle non-deterministic as well as stochastic actions with classical planning! 1.Solve the “deterministic” relaxation of the problem 2.Start executing it, while monitoring the world state 3.When an unexpected state is encountered, replan A planner that did this in the First Intl. Planning Competition— Probabilistic Track, called FF-Replan, won the competition.
30 years of research into programming languages,..and C++ is the result? 20 years of research into decision theoretic planning,..and FF-Replan is the result?
Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive Probabilistic
MDPs as Utility-based problem solving agents Repeat
[can generalize to have action costs C(a,s)] If M ij matrix is not known a priori, then we have a reinforcement learning scenario.. Repeat
(Value) How about deterministic case? U(s i ) is the shortest path to the goal
Think of these as h*() values… Called value function U* Think of these as related to h* values Repeat
Think of these as h*() values… Called value function U* Think of these as related to h* values
Policies change with rewards.. --
What does a solution to an MDP look like? The solution should tell the optimal action to do in each state (called a “Policy”) –Policy is a function from states to actions (* see finite horizon case below*) –Not a sequence of actions anymore Needed because of the non-deterministic actions –If there are |S| states and |A| actions that we can do at each state, then there are |A| |S| policies How do we get the best policy? –Pick the policy that gives the maximal expected reward –For each policy Simulate the policy (take actions suggested by the policy) to get behavior traces Evaluate the behavior traces Take the average value of the behavior traces. How long should behavior traces be? –Each trace is no longer than k (Finite Horizon case) Policy will be horizon-dependent (optimal action depends not just on what state you are in, but how far is your horizon) –Eg: Financial portfolio advice for yuppies vs. retirees. –No limit on the size of the trace (Infinite horizon case) Policy is not horizon dependent Qn: Is there a simpler way than having to evaluate |A| |S| policies? –Yes… We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)
.8.1
Why are values coming down first? Why are some states reaching optimal value faster? Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often.8.1
Terminating Value Iteration The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration) –Set a threshold and stop when the change across two consecutive iterations is less than –There is a minor problem since value is a vector We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||U i – U i+1 || <
Policies converge earlier than values There are finite number of policies but infinite number of value functions. So entire regions of value vector are mapped to a specific policy So policies may be converging faster than values. Search in the space of policies Given a utility vector U i we can compute the greedy policy ui The policy loss of ui is ||U ui U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) V(S 1 ) V(S 2 ) Consider an MDP with 2 states and 2 actions P1P1 P2P2 P3P3 P4P4 U*U*
We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation) n linear equations with n unknowns.
Other ways of solving MDPs Value and Policy iteration are the bed-rock methods for solving MDPs. Both give optimality guarantees Both of them tend to be very inefficient for large (several thousand state) MDPs Many ideas are used to improve the efficiency while giving up optimality guarantees –E.g. Consider the part of the policy for more likely states (envelope extension method) –Interleave “search” and “execution” (Real Time Dynamic Programming) Do limited-depth analysis based on reachability to find the value of a state (and there by the best action you you should be doing—which is the action that is sending you the best value) The values of the leaf nodes are set to be their immediate rewards If all the leaf nodes are terminal nodes, then the backed up value will be true optimal value. Otherwise, it is an approximation… RTDP
What if you see this as a game? The expected value computation is fine if you are maximizing “expected” return If you are --if you are risk-averse? (and think “nature” is out to get you) V 2 = min(V 3,V 4 ) If you are perpetual optimist then V 2 = max(V 3,V 4 )
Incomplete observability (the dreaded POMDPs) To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not) –Policy maps belief states to actions In practice, this causes (humongous) problems –The space of belief states is “continuous” (even if the underlying world is discrete and finite). {GET IT? GET IT??} –Even approximate policies are hard to find (PSPACE-hard). Problems with few dozen world states are hard to solve currently –“Depth-limited” exploration (such as that done in adversarial games) are the only option… Belief state = { s 1 :0.3, s 2 :0.4; s 4 :0.3} This figure basically shows that belief states change as we take actions 5 LEFTs 5 UPs
MDPs and Deterministic Search Problem solving agent search corresponds to what special case of MDP? –Actions are deterministic; Goal states are all equally valued, and are all sink states. Is it worth solving the problem using MDPs? –The construction of optimal policy is an overkill The policy, in effect, gives us the optimal path from every state to the goal state(s)) –The value function, or its approximations, on the other hand are useful. How? As heuristics for the problem solving agent’s search This shows an interesting connection between dynamic programming and “state search” paradigms –DP solves many related problems on the way to solving the one problem we want –State search tries to solve just the problem we want –We can use DP to find heuristics to run state search..
Modeling Softgoal problems as deterministic MDPs
SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation) –Goals are sort of modeled by reward functions Allows pretty expressive goals (in theory) –Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway). Could consider “envelope extension” methods –Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution –RTDP methods SSSP are a special case of MDPs where –(a) initial state is given –(b) there are absorbing goal states –(c) Actions have costs. Goal states have zero costs. A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy) –Value/Policy Iteration don’t consider the notion of relevance –Consider “heuristic state search” algorithms Heuristic can be seen as the “estimate” of the value of a state.
AO* search for solving SSP problems Main issues: -- Cost of a node is expected cost of its children -- The And tree can have LOOPS Cost backup is complicated Intermediate nodes given admissible heuristic estimates --can be just the shortest paths (or their estimates)
LAO*--turning bottom-up labeling into a full DP
RTDP Approach: Interleave Planning & Execution (Simulation) Start from the current state S. Expand the tree (either uniformly to k-levels, or non-uniformly—going deeper in some branches) Evaluate the leaf nodes; back-up the values to S. Update the stored value of S. Pick the action that leads to best value Do it {or simulate it}. Loop back. Leaf nodes evaluated by Using their “cached” values If this node has been evaluated using RTDP analysis in the past, you use its remembered value else use the heuristic value If not use heuristics to estimate a. Immediate reward values b. Reachability heuristics Sort of like depth-limited game-playing (expectimax) --Who is the game against? Can also do “reinforcement learning” this way The M ij are not known correctly in RL
Greedy “On-Policy” RTDP without execution Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize
Envelope Extension Methods For each action, take the most likely outcome and discard the rest. Find a plan (deterministic path) from Init to Goal state. This is a (very partial) policy for just the states that fall on the maximum probability state sequence. Consider states that are most likely to be encountered while traveling this path. Find policy for those states too. Tricky part is to show that we can converge to the optimal policy