5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Situation Calculus for Action Descriptions We talked about STRIPS representations for actions. Another common representation is called the Situation Calculus.
Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
4/22: Scheduling (contd) Planning with incomplete info (start) Earth which has many heights, and slopes and the unconfined plain that bind men together,
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
Markov Decision Processes
Planning under Uncertainty
MDPs as Utility-based problem solving agents
11/22: Conditional Planning & Replanning Current Standings sent Semester project report due 11/30 Homework 4 will be due before the last class Next class:
9/16 Scan by Kalpesh Shah. What is needed: --A neighborhood function The larger the neighborhood you consider, the less myopic the search (but the.
Reinforcement Learning
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.
Markov Decision Processes
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Handling non-determinism and incompleteness. Problems, Solutions, Success Measures: 3 orthogonal dimensions  Incompleteness in the initial state  Un.
Uninformed Search Reading: Chapter 3 by today, Chapter by Wednesday, 9/12 Homework #2 will be given out on Wednesday DID YOU TURN IN YOUR SURVEY?
4/3. Outline… Talk about SSSP problem Talk about DP vs. A* Talk about heuristic—and how the “deterministic plan” can be an admissible heuristic –What.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
CS121 Heuristic Search Planning CSPs Adversarial Search Probabilistic Reasoning Probabilistic Belief Learning.
Department of Computer Science Undergraduate Events More
Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
1 Endgame Logistics  Final Project Presentations  Tuesday, March 19, 3-5, KEC2057  Powerpoint suggested ( to me before class)  Can use your own.
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Department of Computer Science Undergraduate Events More
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Announcements Homework 1 Full assignment posted..
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Search.
CS 416 Artificial Intelligence
Search.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa homework points sent..

Current Grades…

Sapa homework grades

Static Deterministic ObservableInstantaneousPropositional “Classical Planning” Dynamic Replanning / Situated Plans Durative Temporal Reasoning Continuous Numeric Constraint reasoning (LP/ILP) Stochastic Contingent/Conformant Plans, Interleaved execution MDP Policies POMDP Policies Partially Observable Contingent/Conforma nt Plans, Interleaved execution Semi-MDP Policies

All that water under the bridge…  Actions, Proofs, Planning Strategies (Week 2; 1/28;1/30)  More PO planning, dealing with partially instantiated actions, and start of deriving heuristics. (Week 3; 2/4;2/6)  Reachability Heuristics contd. (2/11;/13)  Heuristics for Partial order planning; Graphplan search (2/18; 2/20).  EBL for Graphplan; Solving planning graph by compilation strategies (2/25;2/27).  Compilation to SAT, ILP and Naive Encoding(3/4;3/6).  Knowledge-based planners.  Metric-Temporal Planning: Issues and Representation.  Search Techniques; Heuristics.  Tracking multiple objective heuristics (cost propagation); partialization; LPG  Temporal Constraint Networks; Scheduling  4/22;4/24 Incompleteness and Unertainty; Belief States; Conformant planning  4/29;5/1 Conditional Planning  Decision Theoretic Planning…

Problems, Solutions, Success Measures: 3 orthogonal dimensions  Incompleteness in the initial state  Un (partial) observability of states  Non-deterministic actions  Uncertainty in state or effects  Complex reward functions (allowing degrees of satisfaction)  Conformant Plans: Don’t look— just do  Sequences  Contingent/Conditional Plans: Look, and based on what you see, Do; look again  Directed acyclic graphs  Policies: If in (belief) state S, do action a  (belief) state  action tables  Deterministic Success: Must reach goal-state with probability 1  Probabilistic Success: Must succeed with probability >= k (0<=k<=1)  Maximal Expected Reward: Maximize the expected reward (an optimization problem) MDP POMDP

The Trouble with Probabilities… Once we have probabilities associated with the action effects, as well as the constituents of a belief state,  The belief space size explodes…  Infinitely large  may be able to find a plan if one exists, but exhaustively searching to prove plan doesn’t exist is out of the question  Conformant Probabilistic planning is known to be Semi- decidable  So, solving POMDPs is semi-decidable too.  Introduces the notion of “partial satisfaction” and “expected value” of the plan… (rather than 0-1 valuation)

MDPs are generalizations of Markov chains where transitions are under the control of an agent.  HMMs are thus generalized to POMDPs Useful as normative modeling tools In tons of places: --planning, (reinforcement) learning, multi-agent interactions..

[aka action cost C(a,s)] If M ij matrix is not known a priori, then we have a reinforcement learning scenario..

MDPs vs. Markov Chains  Markov chains are transition systems, where transitions happen automatically  HMMs (hidden markov models) are markov chains where the current state is partially observable. Has been very useful in many different areas.  Generalization to MDPs leads to POMDPs  MDPs are generalizations of Markov chains where transitions are under the control of an agent.

Policies change with rewards.. --

Why are values coming down first? Why are some states reaching optimal value faster? Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often

Policies converge earlier than values Given a utility vector U i we can compute the greedy policy  ui The policy loss of  is ||U   – U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) So search in the space of policies

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have max factor)

The Big Computational Issues in MDP  MDP models are quite easy to specify and understand conceptually. The big issue is “compactness” and “effciency”  Policy construction is polynomial in the size of state space (which is bad news…!)  For POMDPs, the state space is the belief space (infinite  )  Compact representations needed for  Actions  Reward function  Policy  Value  Efficient methods needed for  Policy/value update   Representations that have been tried include:  Decision trees  Neural nets,  Bayesian nets  ADDs (algebraic decision diagrams—which are a general case of BDDs— where the leaf nodes can have real-valued valuation instead of T/F).

SPUDD: Using ADDs to Represent Actions, Rewards and Policies

MDPs and Planning Problems  FOMDPS (fully observable MDPS) can be used to model planning problems with fully observable states, but non-deterministic transitions  POMDPs (partially observable MDPs)—a generalization of MDP framework, where the current state can only be partially observed—will be needed to handle planning problems with partial observability  POMDPs can be solved by converting them into FOMDPs— but the conversion takes us from world states to belief states (which is a continuous space)

SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states  MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation)  Goals are sort of modeled by reward functions  Allows pretty expressive goals (in theory)  Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway).  Could consider “envelope extension” methods  Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution  RTDP methods  SSSP are a special case of MDPs where  (a) initial state is given  (b) there are absorbing goal states  (c) Actions have costs. Goal states have zero costs.  A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states  For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy)  Value/Policy Iteration don’t consider the notion of relevance  Consider “heuristic state search” algorithms  Heuristic can be seen as the “estimate” of the value of a state.  (L)AO* or  RTDP algorithms  (or envelope extension methods)

AO* search for solving SSP problems Main issues: -- Cost of a node is expected cost of its children -- The And tree can have LOOPS  Cost backup is complicated Intermediate nodes given admissible heuristic estimates --can be just the shortest paths (or their estimates)

LAO*--turning bottom-up labeling into a full DP

RTDP Approach: Interleave Planning & Execution (Simulation) Start from the current state S. Expand the tree (either uniformly to k-levels, or non-uniformly—going deeper in some branches) Evaluate the leaf nodes; back-up the values to S. Update the stored value of S. Pick the action that leads to best value Do it {or simulate it}. Loop back. Leaf nodes evaluated by Using their “cached” values  If this node has been evaluated using RTDP analysis in the past, you use its remembered value else use the heuristic value  If not use heuristics to estimate a. Immediate reward values b. Reachability heuristics Sort of like depth-limited game-playing (expectimax) --Who is the game against? Can also do “reinforcement learning” this way  The M ij are not known correctly in RL

Greedy “On-Policy” RTDP without execution  Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

Envelope Extension Methods  For each action, take the most likely outcome and discard the rest.  Find a plan (deterministic path) from Init to Goal state. This is a (very partial) policy for just the states that fall on the maximum probability state sequence.  Consider states that are most likely to be encountered while traveling this path.  Find policy for those states too.  Tricky part is to show that we can converge to the optimal policy

Incomplete observability (the dreaded POMDPs)  To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not)  Policy maps belief states to actions  In practice, this causes (humongous) problems  The space of belief states is “continuous” (even if the underlying world is discrete and finite).  Even approximate policies are hard to find (PSPACE-hard).  Problems with few dozen world states are hard to solve currently  “Depth-limited” exploration (such as that done in adversarial games) are the only option…