Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Situation Calculus for Action Descriptions We talked about STRIPS representations for actions. Another common representation is called the Situation Calculus.
Partially Observable Markov Decision Process (POMDP)
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
MDPs as Utility-based problem solving agents
11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30 Homework 4 will be due before the last class Next class:
10/28 Temporal Probabilistic Models. Temporal (Sequential) Process A temporal process is the evolution of system state over time Often the system state.
9/16 Scan by Kalpesh Shah. What is needed: --A neighborhood function The larger the neighborhood you consider, the less myopic the search (but the.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Handling non-determinism and incompleteness. Problems, Solutions, Success Measures: 3 orthogonal dimensions  Incompleteness in the initial state  Un.
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
Uninformed Search Reading: Chapter 3 by today, Chapter by Wednesday, 9/12 Homework #2 will be given out on Wednesday DID YOU TURN IN YOUR SURVEY?
4/3. Outline… Talk about SSSP problem Talk about DP vs. A* Talk about heuristic—and how the “deterministic plan” can be an admissible heuristic –What.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
11/19  Connection between MC/HMM and MDP/POMDP  Utility in terms of the value of the vantage point.
5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa.
Department of Computer Science Undergraduate Events More
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Department of Computer Science Undergraduate Events More
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Markov Decision Process (MDP)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 188: Artificial Intelligence Spring 2006
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive Probabilistic

MDPs as Utility-based problem solving agents Repeat

[can generalize to have action costs C(a,s)] If M ij matrix is not known a priori, then we have a reinforcement learning scenario.. Repeat

(Value) How about deterministic case? U(s i ) is the shortest path to the goal

Think of these as h*() values… Called value function U* Think of these as related to h* values Repeat

Think of these as h*() values… Called value function U* Think of these as related to h* values

Policies change with rewards.. --

What does a solution to an MDP look like? The solution should tell the optimal action to do in each state (called a “Policy”) –Policy is a function from states to actions (* see finite horizon case below*) –Not a sequence of actions anymore Needed because of the non-deterministic actions –If there are |S| states and |A| actions that we can do at each state, then there are |A| |S| policies How do we get the best policy? –Pick the policy that gives the maximal expected reward –For each policy  Simulate the policy (take actions suggested by the policy) to get behavior traces Evaluate the behavior traces Take the average value of the behavior traces. How long should behavior traces be? –Each trace is no longer than k (Finite Horizon case) Policy will be horizon-dependent (optimal action depends not just on what state you are in, but how far is your horizon) –Eg: Financial portfolio advice for yuppies vs. retirees. –No limit on the size of the trace (Infinite horizon case) Policy is not horizon dependent Qn: Is there a simpler way than having to evaluate |A| |S| policies? –Yes… We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

.8.1 Bellman Eqn with Action Costs

Why are values coming down first? Why are some states reaching optimal value faster? Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often.8.1

Terminating Value Iteration The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration) –Set a threshold  and stop when the change across two consecutive iterations is less than  –There is a minor problem since value is a vector We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by  Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||U i – U i+1 || < 

11/29: MDP Continued..

Stationarity, Markov property etc. It is possible to convert a non-mark MDPs are Markov chains + agent control (in terms of actions) FOMDPs are normal markov chains+actions POMDPs are hidden markov models+actions The evolution of an MDP will look like a markov chain for an outside agent

Value Update using Action Costs

Policies converge earlier than values There are finite number of policies but infinite number of value functions. So entire regions of value vector are mapped to a specific policy So policies may be converging faster than values. Search in the space of policies Given a utility vector U i we can compute the greedy policy  ui The policy loss of  ui is ||U  ui  U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) V(S 1 ) V(S 2 ) Consider an MDP with 2 states and 2 actions P1P1 P2P2 P3P3 P4P4 U*U*

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation) n linear equations with n unknowns.

Action CLK

MDPs and Deterministic Search Problem solving agent search corresponds to what special case of MDP? –Actions are deterministic; Goal states are all equally valued, and are all sink states. Bellman update is like a regression step! –But it regresses over all actions (and takes max) Is it worth solving the problem using MDPs? –The construction of optimal policy is an overkill The policy, in effect, gives us the optimal path from every state to the goal state(s)) –The value function, or its approximations, on the other hand are useful. How? As heuristics for the problem solving agent’s search This shows an interesting connection between dynamic programming and “state search” paradigms –DP solves many related problems on the way to solving the one problem we want –State search tries to solve just the problem we want –We can use DP to find heuristics to run state search..

Modeling Softgoal problems as deterministic MDPs Consider the net-benefit problem, where actions have costs, and goals have utilities, and we want a plan with the highest net benefit How do we model this as MDP? –(wrong idea): Make every state in which any subset of goals hold into a sink state with reward equal to the cumulative sum of utilities of the goals. Problem—what if achieving g1 g2 will necessarily lead you through a state where g1 is already true? –(correct version): Make a new fluent called “done” dummy action called Done-Deal. It is applicable in any state and asserts the fluent “done”. All “done” states are sink states. Their reward is equal to sum of rewards of the individual states.

Scaling up (FO)MDP Approaches Value and Policy iteration are the bed-rock methods for solving MDPs. Both give optimality guarantees –Both of them tend to be very inefficient for large (several thousand state) MDPs Methods for improving them fall in 2 categories –Improve Solution Techniques Either by restricting the type of MDP or by giving up optimality guarantees –Improve Representation Techniques Factored representations for Actions, Reward Functions, Values and Policies Directly manipulating factored representations during the Bellman update

Other ways of solving MDPs Value and Policy iteration are the bed-rock methods for solving MDPs. Both give optimality guarantees Both of them tend to be very inefficient for large (several thousand state) MDPs Many ideas are used to improve the efficiency while (sometimes) giving up optimality guarantees –E.g. Consider the part of the policy for more likely states (envelope extension method) –Interleave “search” and “execution” (Real Time Dynamic Programming) Do limited-depth analysis based on reachability to find the value of a state (and there by the best action you you should be doing—which is the action that is sending you the best value) The values of the leaf nodes are set to be their immediate rewards If all the leaf nodes are terminal nodes, then the backed up value will be true optimal value. Otherwise, it is an approximation… RTDP

SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation) –Goals are sort of modeled by reward functions Allows pretty expressive goals (in theory) –Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway). Could consider “envelope extension” methods –Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution –RTDP methods SSSP are a special case of MDPs where –(a) initial state is given –(b) there are absorbing goal states –(c) Actions have costs. All states have zero rewards A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy) –Value/Policy Iteration don’t consider the notion of relevance –Consider “heuristic state search” algorithms Heuristic can be seen as the “estimate” of the value of a state.

AO* search for solving SSP problems Main issues: -- Cost of a node is expected cost of its children -- The And tree can have LOOPS  Cost backup is complicated Intermediate nodes given admissible heuristic estimates --can be just the shortest paths (or their estimates)

LAO*--turning bottom-up labeling into a full DP

RTDP Approach: Interleave Planning & Execution (Simulation) Start from the current state S. Expand the tree (either uniformly to k-levels, or non-uniformly—going deeper in some branches) Evaluate the leaf nodes; back-up the values to S. Update the stored value of S. Pick the action that leads to best value Do it {or simulate it}. Loop back. Leaf nodes evaluated by Using their “cached” values  If this node has been evaluated using RTDP analysis in the past, you use its remembered value else use the heuristic value  If not use heuristics to estimate a. Immediate reward values b. Reachability heuristics Sort of like depth-limited game-playing (expectimax) --Who is the game against? Can also do “reinforcement learning” this way  The M ij are not known correctly in RL

Greedy “On-Policy” RTDP without execution  Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

Envelope Extension Methods For each action, take the most likely outcome and discard the rest. Find a plan (deterministic path) from Init to Goal state. This is a (very partial) policy for just the states that fall on the maximum probability state sequence. Consider states that are most likely to be encountered while traveling this path. Find policy for those states too. Tricky part is to show that we can converge to the optimal policy

What if you see this as a game? The expected value computation is fine if you are maximizing “expected” return If you are --if you are risk-averse? (and think “nature” is out to get you) V 2 = min(V 3,V 4 ) If you are perpetual optimist then V 2 = max(V 3,V 4 )

Factored Representations: Actions Actions can be represented directly in terms of their effects on the individual state variables (fluents). The CPTs of the BNs can be represented compactly too! –Write a Bayes Network relating the value of fluents at the state before and after the action Bayes networks representing fluents at different time points are called “Dynamic Bayes Networks” We look at 2TBN (2-time-slice dynamic bayes nets) Go further by using STRIPS assumption –Fluents not affected by the action are not represented explicitly in the model –Called Probabilistic STRIPS Operator (PSO) model

Factored Representations: Reward, Value and Policy Functions Reward functions can be represented in factored form too. Possible representations include –Decision trees (made up of fluents) –ADDs (Algebraic decision diagrams) Value functions are like reward functions (so they too can be represented similarly) Bellman update can then be done directly using factored representations..

SPUDDs use of ADDs

Direct manipulation of ADDs in SPUDD

Incomplete observability (the dreaded POMDPs) To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not) –Policy maps belief states to actions In practice, this causes (humongous) problems –The space of belief states is “continuous” (even if the underlying world is discrete and finite). {GET IT? GET IT??} –Even approximate policies are hard to find (PSPACE-hard). Problems with few dozen world states are hard to solve currently –“Depth-limited” exploration (such as that done in adversarial games) are the only option… Belief state = { s 1 :0.3, s 2 :0.4; s 4 :0.3} This figure basically shows that belief states change as we take actions 5 LEFTs 5 UPs

Look up