11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30 Homework 4 will be due before the last class Next class:

Slides:



Advertisements
Similar presentations
Informed search algorithms
Advertisements

Review: Search problem formulation
Heuristic Search techniques
Planning with Non-Deterministic Uncertainty (Where failure is not an option) R&N: Chap. 12, Sect (+ Chap. 10, Sect 10.7)
Markov Decision Process
Friday: 2-3:15pm BY 510 make-up class Today: 1. Online search 2. Planning in Belief-space.
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
Informed Search Methods How can we improve searching strategy by using intelligence? Map example: Heuristic: Expand those nodes closest in “as the crow.
Situation Calculus for Action Descriptions We talked about STRIPS representations for actions. Another common representation is called the Situation Calculus.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Probabilistic Planning Jim Blythe November 6th. 2 CS 541 Probabilistic planning A slide from August 30th: Assumptions (until October..) Atomic time All.
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
CS 380: Artificial Intelligence Lecture #3 William Regli.
Infinite Horizon Problems
Planning under Uncertainty
MDPs as Utility-based problem solving agents
Beyond Classical Search Non-Deterministic Actions  Transition model – Result(s,a) is no longer a singleton  Plans have to be “contingent”  Suck; if.
Partial Observability (State Uncertainty)  Assume non-determinism  Atomic model (for belief states and sensing actions)  Factored model (Progression/Regression)
9/14: Belief Search Heuristics Today: Planning graph heuristics for belief search Wed: MDPs.
This time: Outline Game playing The minimax algorithm
Review: Search problem formulation
3/25  Monday 3/31 st 11:30AM BYENG 210 Talk by Dana Nau Planning for Interactions among Autonomous Agents.
Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.
3/27 Next big topic: Decision Theoretic Planning..
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Handling non-determinism and incompleteness. Problems, Solutions, Success Measures: 3 orthogonal dimensions  Incompleteness in the initial state  Un.
Handling non-determinism and incompleteness
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
Uninformed Search Reading: Chapter 3 by today, Chapter by Wednesday, 9/12 Homework #2 will be given out on Wednesday DID YOU TURN IN YOUR SURVEY?
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa.
4/29: Conditional Planning  No Final. Instead we will have a last homework  Midterm to be returned Thursday; Homework reached Hanoi  Extra class on.
CS121 Heuristic Search Planning CSPs Adversarial Search Probabilistic Reasoning Probabilistic Belief Learning.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Heuristic Search Heuristic - a “rule of thumb” used to help guide search often, something learned experientially and recalled when needed Heuristic Function.
MAKING COMPLEX DEClSlONS
Dana Nau: Lecture slides for Automated Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Lecture 3: Uninformed Search
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
Search CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
A General Introduction to Artificial Intelligence.
MDPs (cont) & Reinforcement Learning
Basic Problem Solving Search strategy  Problem can be solved by searching for a solution. An attempt is to transform initial state of a problem into some.
Goal-based Problem Solving Goal formation Based upon the current situation and performance measures. Result is moving into a desirable state (goal state).
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Solving problems by searching A I C h a p t e r 3.
Adversarial Search 2 (Game Playing)
Chapter 3 Solving problems by searching. Search We will consider the problem of designing goal-based agents in observable, deterministic, discrete, known.
Lecture 3: Uninformed Search
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
CS 416 Artificial Intelligence
Pendulum Swings in AI Top-down vs. Bottom-up
Reinforcement Learning Dealing with Partial Observability
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30 Homework 4 will be due before the last class Next class: Review of MDPs (*please* read Chapter 16 and the class slides)

Sensing Actions  Sensing actions in essence “partition” a belief state  Sensing a formula f splits a belief state B to B&f; B&~f  Both partitions need to be taken to the goal state now  Tree plan  AO* search  Heuristics will have to compare two generalized AND branches  In the figure, the lower branch has an expected cost of 11,000  The upper branch has a fixed sensing cost of based on the outcome, a cost of 7 or 12,000  If we consider worst case cost, we assume the cost is 12,300  If we consider both to be equally likey, we assume units cost  If we know actual probabilities that the sensing action returns one result as against other, we can use that to get the expected cost… AsAs A 7 12,000 11,

Sensing: General observations  Sensing can be thought in terms of  Speicific state variables whose values can be found  OR sensing actions that evaluate truth of some boolean formula over the state variables.  Sense(p) ; Sense(pV(q&r))  A general action may have both causative effects and sensing effects  Sensing effect changes the agent’s knowledge, and not the world  Causative effect changes the world (and may give certain knowledge to the agent)  A pure sensing action only has sensing effects; a pure causative action only has causative effects.

Progression/Regression with Sensing  When applied to a belief state, AT RUN TIME the sensing effects of an action wind up reducing the cardinality of that belief state  basically by removing all states that are not consistent with the sensed effects  AT PLAN TIME, Sensing actions PARTITION belief states  If you apply Sense-f? to a belief state B, you get a partition of B 1 : B&f and B 2 : B&~f  You will have to make a plan that takes both partitions to the goal state  Introduces branches in the plan  If you regress two belief state B&f and B&~f over a sensing action Sense-f?, you get the belief state B

If a state variable p Is in B, then there is some action A p that Can sense whether p is true or false If P=B, the problem is fully observable If B is empty, the problem is non observable If B is a subset of P, it is partially observable Note: Full vs. Partial observability is independent of sensing individual fluents vs. sensing formulas. (assuming single literal sensing)

Full Observability: State Space partitioned to singleton Obs. Classes Non-observability: Entire state space is a single observation class Partial Observability: Between 1 and |S| observation classes

Hardness classes for planning with sensing  Planning with sensing is hard or easy depending on: (easy case listed first)  Whether the sensory actions give us full or partial observability  Whether the sensory actions sense individual fluents or formulas on fluents  Whether the sensing actions are always applicable or have preconditions that need to be achieved before the action can be done

A Simple Progression Algorithm in the presence of pure sensing actions  Call the procedure Plan(B I,G,nil) where  Procedure Plan(B,G,P)  If G is satisfied in all states of B, then return P  Non-deterministically choose:  I. Non-deterministically choose a causative action a that is applicable in B.  Return Plan(a(B),G,P+a)  II. Non-deterministically choose a sensing action s that senses a formula f (could be a single state variable)  Let p’ = Plan(B&f,G,nil); p’’=Plan(B&~f,G,nil)  /*B f is the set of states of B in which f is true */  Return P+(s?:p’;p’’) If we always pick I and never do II then we will produce conformant Plans (if we succeed).

Remarks on the progression with sensing actions  Progression is implicitly finding an AND subtree of an AND/OR Graph  If we look for AND subgraphs, we can represent DAGS.  The amount of sensing done in the eventual solution plan is controlled by how often we pick step I vs. step II (if we always pick I, we get conformant solutions).  Progression is as clue-less as to whether to do sensing and which sensing to do, as it is about which causative action to apply  Need heuristic support

Heuristics for sensing  We need to compare the cumulative distance of B1 and B2 to goal with that of B3 to goal  Notice that Planning cost is related to plan size while plan exec cost is related to the length of the deepest branch (or expected length of a branch)  If we use the conformant belief state distance (as discussed last class), then we will be over estimating the distance (since sensing may allow us to do shorter branch)  Bryce [ICAPS 05—submitted] starts wth the conformant relaxed plan and introduces sensory actions into the plan to estimate the cost more accurately B1 B2 B3

Sensing: More things under the mat (which we won’t lift for now )  Sensing extends the notion of goals (and action preconditions).  Findout goals: Check if Rao is awake vs. Wake up Rao  Presents some tricky issues in terms of goal satisfaction…!  You cannot use “causative” effects to support “findout” goals  But what if the causative effects are supporting another needed goal and wind up affecting the goal as a side-effect? (e.g. Have-gong-go-off & find-out-if-rao-is-awake)  Quantification is no longer syntactic sugaring in effects and preconditions in the presence of sensing actions  Rm* can satisfy the effect forall files remove(file); without KNOWING what are the files in the directory!  This is alternative to finding each files name and doing rm  Sensing actions can have preconditions (as well as other causative effects); they can have cost  The problem of OVER-SENSING (Sort of like a beginning driver who looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project]  Handling over-sensing using local-closedworld assumptions  Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of the file after each and every action

Very simple Example A1 p=>r,~p A2 ~p=>r,p A3 r=>g O5 observe(p) Problem: Init: don’t know p Goal: g Plan: O5:p?[A1  A3][A2  A3] Notice that in this case we also have a conformant plan: A1;A2;A3 --Whether or not the conformant plan is cheaper depends on how costly is sensing action O5 compared to A1 and A2

Very simple Example A1 p=>r,~p A2 ~p=>r,p A3 r=>g O5 observe(p) Problem: Init: don’t know p Goal: g Plan: O5:p?[A1  A3][A2  A3] O5:p? A1 A3 A2 A3 Y N

A more interesting example: Medication The patient is not Dead and may be Ill. The test paper is not Blue. We want to make the patient be not Dead and not Ill We have three actions: Medicate which makes the patient not ill if he is ill Stain—which makes the test paper blue if the patient is ill Sense-paper—which can tell us if the paper is blue or not. No conformant plan possible here. Also, notice that I cannot be sensed directly but only through B This domain is partially observable because the states (~D,I,~B) and (~D,~I,~B) cannot be distinguished

“Goal directed” conditional planning  Recall that regression of two belief state B&f and B&~f over a sensing action Sense-f will result in a belief state B  Search with this definition leads to two challenges: 1.We have to combine search states into single ones (a sort of reverse AO* operation) 2.We may need to explicitly condition a goal formula in partially observable case (especially when certain fluents can only be indirectly sensed)  Example is the Medicate domain where I has to be found through B  If you have a goal state B, you can always write it as B&f and B&~f for any arbitrary f! (The goal Happy is achieved by achieving the twin goals Happy&rich as well as Happy&~rich)  Of course, we need to pick the f such that f/~f can be sensed (i.e. f and ~f defines an observational class feature)  This step seems to go against the grain of “goal-directedenss”—we may not know what to sense based on what our goal is after all!  Regression for PO case is Still not Well-understood

Regresssion

Handling the “combination” during regression  We have to combine search states into single ones (a sort of reverse AO* operation)  Two ideas: 1.In addition to the normal regression children, also generate children from any pair of regressed states on the search fringe (has a breadth-first feel. Can be expensive!) [Tuan Le does this] 2.Do a contingent regression. Specifically, go ahead and generate B from B&f using Sense-f; but now you have to go “forward” from the “not-f” branch of Sense-f to goal too. [CNLP does this; See the example]

Need for explicit conditioning during regression (not needed for Fully Observable case)  If you have a goal state B, you can always write it as B&f and B&~f for any arbitrary f! (The goal Happy is achieved by achieving the twin goals Happy&rich as well as Happy&~rich)  Of course, we need to pick the f such that f/~f can be sensed (i.e. f and ~f defines an observational class feature)  This step seems to go against the grain of “goal-directedenss”—we may not know what to sense based on what our goal is after all!  Consider the Medicate problem. Coming from the goal of ~D&~I, we will never see the connection to sensing blue! Notice the analogy to conditioning in evaluating a probabilistic query

Similar processing can be done for regression (PO planning is nothing but least-committed regression planning) We now have yet another way of handling unsafe links --Conditioning to put the threatening step in a different world!

Sensing: More things under the mat  Sensing extends the notion of goals too.  Check if Rao is awake vs. Wake up Rao  Presents some tricky issues in terms of goal satisfaction…!  Handling quantified effects and preconditions in the presence of sensing actions  Rm* can satisfy the effect forall files remove(file); without KNOWING what are the files in the directory!  Sensing actions can have preconditions (as well as other causative effects)  The problem of OVER-SENSING (Sort of like the initial driver; also Sphexishness) [XII/Puccini project]  Handling over-sensing using local-closedworld assumptions  Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of the file after each and every action  A general action may have both causative effects and sensing effects  Sensing effect changes the agent’s knowledge, and not the world  Causative effect changes the world (and may give certain knowledge to the agent)  A pure sensing action only has sensing effects; a pure causative action only has causative effects.  The recent work on conditional planning has considered mostly simplistic sensing actions that have no preconditions and only have pure sensing effects.  Sensing has cost!

11/24 Replanning MDPs [HW4 updated; See the paper task; Only MDP stuff to be added]

Sensing: More things under the mat (which we won’t lift for now )  Sensing extends the notion of goals (and action preconditions).  Findout goals: Check if Rao is awake vs. Wake up Rao  Presents some tricky issues in terms of goal satisfaction…!  You cannot use “causative” effects to support “findout” goals  But what if the causative effects are supporting another needed goal and wind up affecting the goal as a side-effect? (e.g. Have-gong-go-off & find-out-if-rao-is-awake)  Quantification is no longer syntactic sugaring in effects and preconditions in the presence of sensing actions  Rm* can satisfy the effect forall files remove(file); without KNOWING what are the files in the directory!  This is alternative to finding each files name and doing rm  Sensing actions can have preconditions (as well as other causative effects); they can have cost  The problem of OVER-SENSING (Sort of like a beginning driver who looks all directions every 3 millimeters of driving; also Sphexishness) [XII/Puccini project]  Handling over-sensing using local-closedworld assumptions  Listing a file doesn’t destroy your knowledge about the size of a file; but compressing it does. If you don’t recognize it, you will always be checking the size of the file after each and every action Review

Sensing: Limited Contingency planning  In many real-world scenarios, having a plan that works in all contingencies is too hard  An idea is to make a plan for some of the contingencies; and monitor/Replan as necessary.  Qn: What contingencies should we plan for?  The ones that are most likely to occur…(need likelihoods)  Qn: What do we do if an unexpected contingency arises?  Monitor (the observable parts of the world)  When it goes out of expected world, replan starting from that state.

Things more complicated if the world is partially observable  Need to insert sensing actions to sense fluents that can only be indirectly sensed

“Triangle Tables”

This involves disjunctive goals!

Replanning—Respecting Commitments  In real-world, where you make commitments based on your plan, you cannot just throw away the plan at the first sign of failure  One heuristic is to reuse as much of the old plan as possible while doing replanning.  A more systematic approach is to 1.Capture the commitments made by the agent based on the current plan 2.Give these commitments as additional soft constraints to the planner

Replanning as a universal antidote…  If the domain is observable and lenient to failures, and we are willing to do replanning, then we can always handle non-deterministic as well as stochastic actions with classical planning! 1.Solve the “deterministic” relaxation of the problem 2.Start executing it, while monitoring the world state 3.When an unexpected state is encountered, replan  A planner that did this in the First Intl. Planning Competition— Probabilistic Track, called FF-Replan, won the competition.

30 years of research into programming languages,..and C++ is the result? 20 years of research into decision theoretic planning,..and FF-Replan is the result?

Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive Probabilistic

MDPs as Utility-based problem solving agents Repeat

[can generalize to have action costs C(a,s)] If M ij matrix is not known a priori, then we have a reinforcement learning scenario.. Repeat

(Value) How about deterministic case? U(s i ) is the shortest path to the goal

Think of these as h*() values… Called value function U* Think of these as related to h* values Repeat

Think of these as h*() values… Called value function U* Think of these as related to h* values

Policies change with rewards.. --

What does a solution to an MDP look like? The solution should tell the optimal action to do in each state (called a “Policy”) –Policy is a function from states to actions (* see finite horizon case below*) –Not a sequence of actions anymore Needed because of the non-deterministic actions –If there are |S| states and |A| actions that we can do at each state, then there are |A| |S| policies How do we get the best policy? –Pick the policy that gives the maximal expected reward –For each policy  Simulate the policy (take actions suggested by the policy) to get behavior traces Evaluate the behavior traces Take the average value of the behavior traces. How long should behavior traces be? –Each trace is no longer than k (Finite Horizon case) Policy will be horizon-dependent (optimal action depends not just on what state you are in, but how far is your horizon) –Eg: Financial portfolio advice for yuppies vs. retirees. –No limit on the size of the trace (Infinite horizon case) Policy is not horizon dependent Qn: Is there a simpler way than having to evaluate |A| |S| policies? –Yes… We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

.8.1

Why are values coming down first? Why are some states reaching optimal value faster? Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often.8.1

Terminating Value Iteration The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration) –Set a threshold  and stop when the change across two consecutive iterations is less than  –There is a minor problem since value is a vector We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by  Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||U i – U i+1 || < 

Policies converge earlier than values There are finite number of policies but infinite number of value functions. So entire regions of value vector are mapped to a specific policy So policies may be converging faster than values. Search in the space of policies Given a utility vector U i we can compute the greedy policy  ui The policy loss of  ui is ||U  ui  U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) V(S 1 ) V(S 2 ) Consider an MDP with 2 states and 2 actions P1P1 P2P2 P3P3 P4P4 U*U*

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation) n linear equations with n unknowns.

Other ways of solving MDPs Value and Policy iteration are the bed-rock methods for solving MDPs. Both give optimality guarantees Both of them tend to be very inefficient for large (several thousand state) MDPs Many ideas are used to improve the efficiency while giving up optimality guarantees –E.g. Consider the part of the policy for more likely states (envelope extension method) –Interleave “search” and “execution” (Real Time Dynamic Programming) Do limited-depth analysis based on reachability to find the value of a state (and there by the best action you you should be doing—which is the action that is sending you the best value) The values of the leaf nodes are set to be their immediate rewards If all the leaf nodes are terminal nodes, then the backed up value will be true optimal value. Otherwise, it is an approximation… RTDP

What if you see this as a game? The expected value computation is fine if you are maximizing “expected” return If you are --if you are risk-averse? (and think “nature” is out to get you) V 2 = min(V 3,V 4 ) If you are perpetual optimist then V 2 = max(V 3,V 4 )

Incomplete observability (the dreaded POMDPs) To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not) –Policy maps belief states to actions In practice, this causes (humongous) problems –The space of belief states is “continuous” (even if the underlying world is discrete and finite). {GET IT? GET IT??} –Even approximate policies are hard to find (PSPACE-hard). Problems with few dozen world states are hard to solve currently –“Depth-limited” exploration (such as that done in adversarial games) are the only option… Belief state = { s 1 :0.3, s 2 :0.4; s 4 :0.3} This figure basically shows that belief states change as we take actions 5 LEFTs 5 UPs

MDPs and Deterministic Search Problem solving agent search corresponds to what special case of MDP? –Actions are deterministic; Goal states are all equally valued, and are all sink states. Is it worth solving the problem using MDPs? –The construction of optimal policy is an overkill The policy, in effect, gives us the optimal path from every state to the goal state(s)) –The value function, or its approximations, on the other hand are useful. How? As heuristics for the problem solving agent’s search This shows an interesting connection between dynamic programming and “state search” paradigms –DP solves many related problems on the way to solving the one problem we want –State search tries to solve just the problem we want –We can use DP to find heuristics to run state search..

Modeling Softgoal problems as deterministic MDPs

SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation) –Goals are sort of modeled by reward functions Allows pretty expressive goals (in theory) –Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway). Could consider “envelope extension” methods –Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution –RTDP methods SSSP are a special case of MDPs where –(a) initial state is given –(b) there are absorbing goal states –(c) Actions have costs. Goal states have zero costs. A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy) –Value/Policy Iteration don’t consider the notion of relevance –Consider “heuristic state search” algorithms Heuristic can be seen as the “estimate” of the value of a state.

AO* search for solving SSP problems Main issues: -- Cost of a node is expected cost of its children -- The And tree can have LOOPS  Cost backup is complicated Intermediate nodes given admissible heuristic estimates --can be just the shortest paths (or their estimates)

LAO*--turning bottom-up labeling into a full DP

RTDP Approach: Interleave Planning & Execution (Simulation) Start from the current state S. Expand the tree (either uniformly to k-levels, or non-uniformly—going deeper in some branches) Evaluate the leaf nodes; back-up the values to S. Update the stored value of S. Pick the action that leads to best value Do it {or simulate it}. Loop back. Leaf nodes evaluated by Using their “cached” values  If this node has been evaluated using RTDP analysis in the past, you use its remembered value else use the heuristic value  If not use heuristics to estimate a. Immediate reward values b. Reachability heuristics Sort of like depth-limited game-playing (expectimax) --Who is the game against? Can also do “reinforcement learning” this way  The M ij are not known correctly in RL

Greedy “On-Policy” RTDP without execution  Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

Envelope Extension Methods For each action, take the most likely outcome and discard the rest. Find a plan (deterministic path) from Init to Goal state. This is a (very partial) policy for just the states that fall on the maximum probability state sequence. Consider states that are most likely to be encountered while traveling this path. Find policy for those states too. Tricky part is to show that we can converge to the optimal policy