Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1 A2 A1 A2 A1 A2 A1 A2 Left Outcomes are more likely

FF-Replan Simple replanner Determinizes the probabilistic problem Solves for a plan in the determinized problem SG a1a2a3a4 a2 a3 a4 G a5

All Outcome Replanning (FFR A ) Action Effect 1 Effect 2 Probability 1 Probability 2 Action 1 Effect 1 Action 2 Effect 2 ICAPS-07 3

Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 4 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 5 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

Problems of FF-Replan and better alternative sampling 6 FF-Replan’s Static Determinizations don’t respect probabilities. We need “Probabilistic and Dynamic Determinization” Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem

Hindsight Optimization Probabilistic Planning via Determinization in Hindsight Adds some probabilistic intelligence A kind of dynamic determinization of FF-Replan

Implementation FF-Hindsight Constructs a set of futures Solves the planning problem using the H-horizon futures using FF Sums the rewards of each of the plans Chooses action with largest Qhs value

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 9 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

10 Start Sampling Note. Sampling will reveal which is better A1? Or A2 at state I Sample Time!

Hindsight Sample 1 Action Probabilistic Outcome Time 1 Time 2 Goal State 11 Action State Maximize Goal Achievement Dead End A1: 1 A2: 0 Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Hindsight Sample 2 Action Probabilistic Outcome Time 1 Time 2 Goal State 12 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 2 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Hindsight Sample 3 Action Probabilistic Outcome Time 1 Time 2 Goal State 13 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 2 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Hindsight Sample Action Probabilistic Outcome Time 1 Time 2 Goal State 14 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 3 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Action Selection We can now choose the action with the greatest Qhs value (A1) A1: 3 A2: 1 Better action selection than FF-Replan –Reflects probabilistic outcomes of the actions

Constraints on FF-Hop Number of futures limits exploration Many plans need to be solved per action in action selection Max depth of search is static and limited (horizon)

Improving Hindsight Optimization Scaling Hindsight Optimization for Probabilistic Planning –Uses three methods to improve FF-Hop Zero-step look ahead (Useful action detection, sample and plan reuse) Exploits determinism All-outcome determinization –Significantly improves the scalability of FF-Hop by reducing the number of plans solved by FF

Zero Step Look Ahead Generates a set of futures before OSL Solves the futures with FF Selects the first action of each plan to be used as a ‘useful action’ when applying HOP *the futures and plans are reused in the OSL step

G Possible actions from S: a1a1 a2a2 a3a3 … anan Plans generated in ZSL step S G G SS A h = {a 1, a 3 } ZSL Step a1a1 a3a3 a1a1

Exploiting Determinism Find the longest prefix for all plans Apply the actions in the prefix to continuously until one is not applicable Resume ZSL/OSL steps

Exploiting Determinism G S1S1 G S1S1 G S1S1 a* Plans generated for chosen action, a* Longest prefix for each plan is identified and executed without running ZSL, OSL or FF!

All-outcome Determinization Assign each possible outcome an action Solve for a plan Combine the plan with the plans from the HOP solutions

Further Improvement Ideas Reuse –Generated futures that are still relevant –Scoring for action branches at each step –If expected outcomes occur, keep the plan Future generation –Not just probabilistic –Somewhat even distribution of the space Adaptation –Dynamic width and horizon for sampling –Actively detect and avoid unrecoverable failures on top of sampling

Deterministic Techniques for Stochastic Planning No longer the Rodney Dangerfield of Stochastic Planning?

Solving stochastic planning problems via determinizations Quite an old idea (e.g. envelope extension methods) What is new is that there is increasing realization that determinizing approaches provide state-of-the-art performance –Even for probabilistically interesting domains Should be a happy occasion..

Ways of using deterministic planning To compute the conditional branches –Robinson et al. To seed/approximate the value function –ReTraSE,Peng Dai, McLUG/POND, FF-Hop Use single determinization –FF-replan –ReTrASE (use diverse plans for a single determinization) Use sampled determinizations –FF-hop [AAAI 2008; with Yoon et al] –Use Relaxed solutions (for sampled determinizations) Peng Dai’s paper McLug [AIJ 2008; with Bryce et al] Would be good to understand the tradeoffs… Determinization = Sampling evolution of the world

Comparing approaches.. ReTrASE and FF-Hop seem closely related –ReTrASE uses diverse deterministic plans for a single determinization; FF-HOP computes deterministic plans for sampled determinizations –Is there any guarantee that syntactic (action) diversity is actually related to likely sample worlds? Cost of generating deterministic plans isn’t exactly too cheap.. –Relaxed reachability style approaches can compute multiple plans (for samples of the worlds) Would relaxation of samples’ plans be better or worse in convergence terms..?

Science may never fully explain who killed JFK, but any explanation must pass the scientific judgement. MDPs may never fully generate policies efficiently but any approach that does must pass MDP judgement.

Mathematical Summary of the Algorithm H-horizon future F H for M = [S,A,T,R] –Mapping of state, action and time (h<H) to a state –S × A × h → S Value of a policy π for F H –R(s,F H, π) V HS (s,H) = E F H [max π R(s,F H,π)] Compare this and the real value V*(s,H) = max π E F H [ R(s,F H,π) ] V FFRa (s) = max F V(s,F) ≥ V HS (s,H) ≥ V*(s,H) Q(s,a,H) = (R(a) + E F H-1 [max π R(a(s),F H-1,π)] ) –In our proposal, computation of max π R(s,F H-1,π) is approximately done by FF [Hoffmann and Nebel ’01] 29 Done by FF Each Future is a Deterministic Problem

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.

Similar presentations

Presentation on theme: "Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.

Similar presentations

Presentation on theme: "Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1."— Presentation transcript:

Similar presentations

About project

Feedback