Summary of MDPs (until Now) Finite-horizon MDPs – Non-stationary policy – Value iteration Compute V 0..V k.. V T the value functions for k stages to go.

Slides:



Advertisements
Similar presentations
Informed search algorithms
Advertisements

Review: Search problem formulation
Markov Decision Process
Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning via Determinization in Hindsight FF-Hindsight Sungwook Yoon Joint work.
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
Informed Search Methods How can we improve searching strategy by using intelligence? Map example: Heuristic: Expand those nodes closest in “as the crow.
Situation Calculus for Action Descriptions We talked about STRIPS representations for actions. Another common representation is called the Situation Calculus.
Solving Problem by Searching
Efficient Approaches for Solving Large-scale MDPs Slides on LRTDP and UCT are courtesy Mausam/Kolobov.
ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
Infinite Horizon Problems
Planning under Uncertainty
4/3. (FO)MDPs: The plan General model has no initial state; complex cost and reward functions, and finite/infinite/indefinite horizons Standard algorithms.
Two Models of Evaluating Probabilistic Planning IPPC (Probabilistic Planning Competition) – How often did you reach the goal under the given time constraints.
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
Reinforcement Learning
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.
Markov Decision Processes
Deterministic Techniques for Stochastic Planning No longer the Rodney Dangerfield of Stochastic Planning?
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Handling non-determinism and incompleteness. Problems, Solutions, Success Measures: 3 orthogonal dimensions  Incompleteness in the initial state  Un.
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
4/3. Outline… Talk about SSSP problem Talk about DP vs. A* Talk about heuristic—and how the “deterministic plan” can be an admissible heuristic –What.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Vilalta&Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Factored Approches for MDP & RL (Some Slides taken from Alan Fern’s course)
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Markov Decision Processes
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Summary of MDPs (until Now) Finite-horizon MDPs – Non-stationary policy – Value iteration Compute V 0..V k.. V T the value functions for k stages to go V k is computed in terms of V k-1 Policy  k is MEU of V k Infinite-horizon MDPs – Stationary policy – Value iteration Converges because of contraction property of Bellman operator – Policy iteration Indefinite horizon MDPs -- Stochastic Shortest Path problems (with initial state given) Proper policies -- Can exploit start state

Ideas for Efficient Algorithms.. Use heuristic search (and reachability information) – LAO*, RTDP Use execution and/or Simulation – “Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model) – “Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc. Use “factored” representations – Factored representations for Actions, Reward Functions, Values and Policies – Directly manipulating factored representations during the Bellman update

Heuristic Search vs. Dynamic Programming (Value/Policy Iteration) VI and PI approaches use Dynamic Programming Update Set the value of a state in terms of the maximum expected value achievable by doing actions from that state. They do the update for every state in the state space – Wasteful if we know the initial state(s) that the agent is starting from Heuristic search (e.g. A*/AO*) explores only the part of the state space that is actually reachable from the initial state Even within the reachable space, heuristic search can avoid visiting many of the states. – Depending on the quality of the heuristic used.. But what is the heuristic? – An admissible heuristic is a lowerbound on the cost to reach goal from any given state – It is a lowerbound on J*!

Real Time Dynamic Programming [Barto, Bradtke, Singh’95] Trial: simulate greedy policy starting from start state; perform Bellman backup on visited states RTDP: repeat Trials until cost function converges RTDP was originally introduced for Reinforcement Learning  For RL, instead of “simulate” you “execute”  You also have to do “exploration” in addition to “exploitation”  with probability p, follow the greedy policy with 1-p pick a random action

0 0 Stochastic Shortest Path MDP

Min ? ? s0s0 JnJn JnJn JnJn JnJn JnJn JnJn JnJn Q n+1 (s 0,a) J n+1 (s 0 ) a greedy = a 2 Goal a1a1 a2a2 a3a3 RTDP Trial ?

Greedy “On-Policy” RTDP without execution  Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

Comments Properties – if all states are visited infinitely often then J n → J* – Only relevant states will be considered A state is relevant if the optimal policy could visit it.  Notice emphasis on “optimal policy”—just because a rough neighborhood surrounds National Mall doesn’t mean that you will need to know what to do in that neighborhood Advantages – Anytime: more probable states explored quickly Disadvantages – complete convergence is slow! – no termination condition Do we care about complete convergence?  Think Cpt. Sullenberger

Labeled RTDP [Bonet&Geffner’03] Initialise J 0 with an admissible heuristic – ⇒ J n monotonically increases Label a state as solved – if the J n for that state has converged Backpropagate ‘solved’ labeling Stop trials when they reach any solved state Terminate when s 0 is solved sG high Q costs best action ) J(s) won’t change! sG ? t both s and t get solved together high Q costs Converged means bellman residual is less than 

Properties admissible J 0 ⇒ optimal J* heuristic-guided – explores a subset of reachable state space anytime – focusses attention on more probable states fast convergence – focusses attention on unconverged states terminates in finite time

Recent Advances: Focused RTDP [Smith&Simmons’06] Similar to Bounded RTDP except – a more sophisticated definition of priority that combines gap and prob. of reaching the state – adaptively increasing the max-trial length Recent Advances: Learning DFS [Bonet&Geffner’06]  Iterative Deepening A* equivalent for MDPs  Find strongly connected components to check for a state being solved.

Other Advances Ordering the Bellman backups to maximise information flow. – [Wingate & Seppi’05] – [Dai & Hansen’07] Partition the state space and combine value iterations from different partitions. – [Wingate & Seppi’05] – [Dai & Goldsmith’07] External memory version of value iteration – [Edelkamp, Jabbar & Bonet’07] …

Probabilistic Planning --The competition (IPPC) --The Action language.. (PPDDL)

Factored Representations: Actions Actions can be represented directly in terms of their effects on the individual state variables (fluents). The CPTs of the BNs can be represented compactly too! – Write a Bayes Network relating the value of fluents at the state before and after the action Bayes networks representing fluents at different time points are called “Dynamic Bayes Networks” We look at 2TBN (2-time-slice dynamic bayes nets) Go further by using STRIPS assumption – Fluents not affected by the action are not represented explicitly in the model – Called Probabilistic STRIPS Operator (PSO) model

Action CLK

Not ergodic

How to compete? Off-line policy generation First compute the whole policy – Get the initial state – Compute the optimal policy given the initial state and the goals Then just execute the policy – Loop Do action recommended by the policy Get the next state – Until reaching goal state Pros: Can anticipate all problems; Cons: May take too much time to start executing Online action selection Loop – Compute the best action for the current state – execute it – get the new state Pros: Provides fast first response Cons: May paint itself into a corner.. Policy Computation ExecSelect exex exex exex exex

Two Models of Evaluating Probabilistic Planning IPPC (Probabilistic Planning Competition) – How often did you reach the goal under the given time constraints FF-HOP FF-Replan Evaluate on the quality of the policy – Converging to optimal policy faster LRTDP mGPT Kolobov’s approach

1 st IPPC & Post-Mortem.. IPPC Competitors Most IPPC competitors used different approaches for offline policy generation. One group implemented a simple online “replanning” approach in addition to offline policy generation – Determinize the probabilistic problem Most-likely vs. All-outcomes – Loop Get the state S; Call a classical planner (e.g. FF) with [S,G] as the problem Execute the first action of the plan Umpteen reasons why such an approach should do quite badly.. Results and Post-mortem To everyone’s surprise, the replanning approach wound up winning the competition. Lots of hand-wringing ensued.. – May be we should require that the planners really really use probabilities? – May be the domains should somehow be made “probabilistically interesting”? Current understanding: – No reason to believe that off-line policy computation must dominate online action selection – The “replanning” approach is just a degenerate case of hind-sight optimization

FF-Replan Simple replanner Determinizes the probabilistic problem Solves for a plan in the determinized problem SG a1a2a3a4 a2 a3 a4 G a5

All Outcome Replanning (FFR A ) Action Effect 1 Effect 2 Probability 1 Probability 2 Action 1 Effect 1 Action 2 Effect 2 ICAPS-07 27

Reducing calls to FF.. We can reduce calls to FF by memoizing successes – If we were given s0 and sG as the problem, and solved it using our determinization to get the plan s0—a0—s1—a1—s2—a2—s3…an—sG – Then in addition to sending a1 to the simulator, we can memoize {si—ai} as the partial policy. Whenever a new state is given by the simulator, we can see if it is already in the partial policy Additionally, FF-replan can consider every state in the partial policy table as a goal state (in that if it reaches them, it knows how to get to goal state..)

Hindsight Optimization for Anticipatory Planning/Scheduling Consider a deterministic planning (scheduling) domain, where the goals arrive probabilistically – Using up resources and/or doing greedy actions may preclude you from exploiting the later opportunities How do you select actions to perform? – Answer: If you have a distribution of the goal arrival, then Sample goals upto a certain horizon using this distribution Now, we have a deterministic planning problem with known goals Solve it; do the first action from it. – Can improve accuracy with multiple samples FF-Hop uses this idea for stochastic planning. In anticipatory planning, the uncertainty is exogenous (it is the uncertain arrival of goals). In stochastic planning, the uncertainty is endogenous (the actions have multiple outcomes)

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 30 Action State Maximize Goal Achievement Dead End A1A2 I A1 A2 A1 A2 A1 A2 A1 A2 Left Outcomes are more likely

Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 31 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 32 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

Problems of FF-Replan and better alternative sampling 33 FF-Replan’s Static Determinizations don’t respect probabilities. We need “Probabilistic and Dynamic Determinization” Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem

Hindsight Optimization (Online Computation of V HS ) Pick action a with highest Q(s,a,H) where – Q(s,a,H) = R(s,a) +  T(s,a,s’)V*(s’,H-1) Compute V* by sampling – H-horizon future F H for M = [S,A,T,R] Mapping of state, action and time (h<H) to a state – S × A × h → S Common-random number (correlated) vs. independent futures.. Time-independent vs. Time-dependent futures Value of a policy π for F H – R(s,F H, π ) V*(s,H) = max π E F H [ R(s,F H, π ) ] – But this is still too hard to compute.. – Let’s swap max and expectation V HS (s,H) = E F H [max π R(s,F H, π )] – max π R(s,F H-1, π ) is approximated by FF plan V HS overestimates V* Why? – Intuitively, because V HS can assume that it can use different policies in different futures; while V* needs to pick one policy that works best (in expectation) in all futures. But then, V FFRa overestimates V HS – Viewed in terms of J*, V HS is a more informed admissible heuristic.. 34

Solving stochastic planning problems via determinizations Quite an old idea (e.g. envelope extension methods) What is new is that there is increasing realization that determinizing approaches provide state-of-the-art performance –Even for probabilistically interesting domains Should be a happy occasion..

Hindsight Optimization (Online Computation of V HS ) H-horizon future F H for M = [S,A,T,R] –Mapping of state, action and time (h<H) to a state –S × A × h → S Value of a policy π for F H –R(s,F H, π) Pick action a with highest Q(s,a,H) where –Q(s,a,H) = R(s) + V*(s,H-1) V*(s,H) = max π E F H [ R(s,F H,π) ] Compare this and the real value V HS (s,H) = E F H [max π R(s,F H,π)] V FFRa (s) = max F V(s,F) ≥ V HS (s,H) ≥ V*(s,H) Q(s,a,H) = (R(a) + E F H-1 [max π R(a(s),F H-1,π)] ) –In our proposal, computation of max π R(s,F H-1,π) is approximately done by FF [Hoffmann and Nebel ’01] 36 Done by FF Each Future is a Deterministic Problem

Implementation FF-Hindsight Constructs a set of futures Solves the planning problem using the H-horizon futures using FF Sums the rewards of each of the plans Chooses action with largest Qhs value

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 38 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Improvement Ideas Reuse –Generated futures that are still relevant –Scoring for action branches at each step –If expected outcomes occur, keep the plan Future generation –Not just probabilistic –Somewhat even distribution of the space Adaptation –Dynamic width and horizon for sampling –Actively detect and avoid unrecoverable failures on top of sampling

Hindsight Sample 1 Action Probabilistic Outcome Time 1 Time 2 Goal State 40 Action State Maximize Goal Achievement Dead End A1: 1 A2: 0 Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Exploiting Determinism Find the longest prefix for all plans Apply the actions in the prefix to continuously until one is not applicable Resume ZSL/OSL steps

Exploiting Determinism G S1S1 G S1S1 G S1S1 a* Plans generated for chosen action, a* Longest prefix for each plan is identified and executed without running ZSL, OSL or FF!

Handling unlikely outcomes: All-outcome Determinization Assign each possible outcome an action Solve for a plan Combine the plan with the plans from the HOP solutions

Deterministic Techniques for Stochastic Planning No longer the Rodney Dangerfield of Stochastic Planning?

Determinizations Most-likely outcome determinization – Inadmissible – e.g. if only path to goal relies on less likely outcome of an action All outcomes determinization – Admissible, but not very informed – e.g. Very unlikely action leads you straight to goal

Relaxations for Stochastic Planning Determinizations can also be used as a basis for heuristics to initialize the V for value iteration [mGPT; GOTH etc] Heuristics come from relaxation We can relax along two separate dimensions: – Relax –ve interactions Consider +ve interactions alone using relaxed planning graphs – Relax uncertainty Consider determinizations – Or a combination of both!

Solving Determinizations If we relax –ve interactions – Then compute relaxed plan Admissible if optimal relaxed plan is computed Inadmissible otherwise If we keep –ve interactions – Then use a deterministic planner (e.g. FF/LPG) Inadmissible unless the underlying planner is optimal

Dimensions of Relaxation Uncertainty Negative Interactions Relaxed Plan Heuristic 2 2 McLUG 3 3 FF/LPG Reducing Uncertainty Bound the number of stochastic outcomes  Stochastic “width” Limited width stochastic planning? Increasing consideration 

Dimensions of Relaxation NoneSomeFull NoneRelaxed PlanMcLUG Some FullFF/LPGLimited width Stoch Planning Uncertainty -ve interactions

Expressiveness v. Cost h = 0 McLUG FF-Replan FF Limited width stochastic planning Node Expansions v. Heuristic Computation Cost Nodes Expanded Computation Cost FF R FF

Reducing Heuristic Computation Cost by exploiting factored representations The heuristics computed for a state might give us an idea about the heuristic value of other “similar” states – Similarity is possible to determine in terms of the state structure Exploit overlapping structure of heuristics for different states – E.g. SAG idea for McLUG – E.g. Triangle tables idea for plans (c.f. Kolobov)

A Plan is a Terrible Thing to Waste Suppose we have a plan – s0—a0—s1—a1—s2—a2—s3…an—sG – We realized that this tells us not just the estimated value of s0, but also of s1,s2…sn – So we don’t need to compute the heuristic for them again Is that all? – If we have states and actions in factored representation, then we can explain exactly what aspects of si are relevant for the plan’s success. – The “explanation” is a proof of correctness of the plan » Can be based on regression (if the plan is a sequence) or causal proof (if the plan is a partially ordered one. The explanation will typically be just a subset of the literals making up the state – That means actually, the plan suffix from si may actually be relevant in many more states that are consistent with that explanation

Triangle Table Memoization Use triangle tables / memoization C C B B A A A A B B C C If the above problem is solved, then we don’t need to call FF again for the below: B B A A A A B B

Explanation-based Generalization (of Successes and Failures) Suppose we have a plan P that solves a problem [S, G]. We can first find out what aspects of S does this plan actually depend on – Explain (prove) the correctness of the plan, and see which parts of S actually contribute to this proof – Now you can memoize this plan for just that subset of S

Factored Representations: Reward, Value and Policy Functions Reward functions can be represented in factored form too. Possible representations include – Decision trees (made up of fluents) – ADDs (Algebraic decision diagrams) Value functions are like reward functions (so they too can be represented similarly) Bellman update can then be done directly using factored representations..

SPUDDs use of ADDs

Direct manipulation of ADDs in SPUDD