Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Reinforcement Learning
Markov Decision Process
Top 5 Worst Times For A Conference Talk 1.Last Day 2.Last Session of Last Day 3.Last Talk of Last Session of Last Day 4.Last Talk of Last Session of Last.
Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning via Determinization in Hindsight FF-Hindsight Sungwook Yoon Joint work.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
Markov Decision Processes
Planning under Uncertainty
Summary of MDPs (until Now) Finite-horizon MDPs – Non-stationary policy – Value iteration Compute V 0..V k.. V T the value functions for k stages to go.
CPSC 322, Lecture 15Slide 1 Stochastic Local Search Computer Science cpsc322, Lecture 15 (Textbook Chpt 4.8) February, 6, 2009.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Deterministic Techniques for Stochastic Planning No longer the Rodney Dangerfield of Stochastic Planning?
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Uninformed Search Reading: Chapter 3 by today, Chapter by Wednesday, 9/12 Homework #2 will be given out on Wednesday DID YOU TURN IN YOUR SURVEY?
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Department of Computer Science Undergraduate Events More
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
1 Combinatorial Problems in Cooperative Control: Complexity and Scalability Carla Gomes and Bart Selman Cornell University Muri Meeting March 2002.
Distributed Constraint Optimization * some slides courtesy of P. Modi
Solving problems by searching This Lecture Read Chapters 3.1 to 3.4 Next Lecture Read Chapter 3.5 to 3.7 (Please read lecture topic material before and.
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
Planning and Verification for Stochastic Processes with Asynchronous Events Håkan L. S. Younes Carnegie Mellon University.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
1 Solving problems by searching This Lecture Chapters 3.1 to 3.4 Next Lecture Chapter 3.5 to 3.7 (Please read lecture topic material before and after each.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
Introduction Metaheuristics: increasingly popular in research and industry mimic natural metaphors to solve complex optimization problems efficient and.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Stochastic tree search and stochastic games
Making complex decisions
Analytics and OR DP- summary.
Reinforcement Learning (1)
The story of distributed constraint optimization in LA: Relaxed
Markov Decision Processes
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
Designing Neural Network Architectures Using Reinforcement Learning
Minimax strategies, alpha beta pruning
CS 416 Artificial Intelligence
Search.
CS 416 Artificial Intelligence
Search.
Minimax strategies, alpha beta pruning
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Discrete Optimization
Presentation transcript:

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1 A2 A1 A2 A1 A2 A1 A2 Left Outcomes are more likely

FF-Replan Simple replanner Determinizes the probabilistic problem Solves for a plan in the determinized problem SG a1a2a3a4 a2 a3 a4 G a5

All Outcome Replanning (FFR A ) Action Effect 1 Effect 2 Probability 1 Probability 2 Action 1 Effect 1 Action 2 Effect 2 ICAPS-07 3

Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 4 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 5 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

Problems of FF-Replan and better alternative sampling 6 FF-Replan’s Static Determinizations don’t respect probabilities. We need “Probabilistic and Dynamic Determinization” Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem

Hindsight Optimization Probabilistic Planning via Determinization in Hindsight Adds some probabilistic intelligence A kind of dynamic determinization of FF-Replan

Implementation FF-Hindsight Constructs a set of futures Solves the planning problem using the H-horizon futures using FF Sums the rewards of each of the plans Chooses action with largest Qhs value

Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 9 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

10 Start Sampling Note. Sampling will reveal which is better A1? Or A2 at state I Sample Time!

Hindsight Sample 1 Action Probabilistic Outcome Time 1 Time 2 Goal State 11 Action State Maximize Goal Achievement Dead End A1: 1 A2: 0 Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Hindsight Sample 2 Action Probabilistic Outcome Time 1 Time 2 Goal State 12 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 2 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Hindsight Sample 3 Action Probabilistic Outcome Time 1 Time 2 Goal State 13 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 2 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Hindsight Sample Action Probabilistic Outcome Time 1 Time 2 Goal State 14 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 3 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Action Selection We can now choose the action with the greatest Qhs value (A1) A1: 3 A2: 1 Better action selection than FF-Replan –Reflects probabilistic outcomes of the actions

Constraints on FF-Hop Number of futures limits exploration Many plans need to be solved per action in action selection Max depth of search is static and limited (horizon)

Improving Hindsight Optimization Scaling Hindsight Optimization for Probabilistic Planning –Uses three methods to improve FF-Hop Zero-step look ahead (Useful action detection, sample and plan reuse) Exploits determinism All-outcome determinization –Significantly improves the scalability of FF-Hop by reducing the number of plans solved by FF

Zero Step Look Ahead Generates a set of futures before OSL Solves the futures with FF Selects the first action of each plan to be used as a ‘useful action’ when applying HOP *the futures and plans are reused in the OSL step

G Possible actions from S: a1a1 a2a2 a3a3 … anan Plans generated in ZSL step S G G SS A h = {a 1, a 3 } ZSL Step a1a1 a3a3 a1a1

Exploiting Determinism Find the longest prefix for all plans Apply the actions in the prefix to continuously until one is not applicable Resume ZSL/OSL steps

Exploiting Determinism G S1S1 G S1S1 G S1S1 a* Plans generated for chosen action, a* Longest prefix for each plan is identified and executed without running ZSL, OSL or FF!

All-outcome Determinization Assign each possible outcome an action Solve for a plan Combine the plan with the plans from the HOP solutions

Further Improvement Ideas Reuse –Generated futures that are still relevant –Scoring for action branches at each step –If expected outcomes occur, keep the plan Future generation –Not just probabilistic –Somewhat even distribution of the space Adaptation –Dynamic width and horizon for sampling –Actively detect and avoid unrecoverable failures on top of sampling

Deterministic Techniques for Stochastic Planning No longer the Rodney Dangerfield of Stochastic Planning?

Solving stochastic planning problems via determinizations Quite an old idea (e.g. envelope extension methods) What is new is that there is increasing realization that determinizing approaches provide state-of-the-art performance –Even for probabilistically interesting domains Should be a happy occasion..

Ways of using deterministic planning To compute the conditional branches –Robinson et al. To seed/approximate the value function –ReTraSE,Peng Dai, McLUG/POND, FF-Hop Use single determinization –FF-replan –ReTrASE (use diverse plans for a single determinization) Use sampled determinizations –FF-hop [AAAI 2008; with Yoon et al] –Use Relaxed solutions (for sampled determinizations) Peng Dai’s paper McLug [AIJ 2008; with Bryce et al] Would be good to understand the tradeoffs… Determinization = Sampling evolution of the world

Comparing approaches.. ReTrASE and FF-Hop seem closely related –ReTrASE uses diverse deterministic plans for a single determinization; FF-HOP computes deterministic plans for sampled determinizations –Is there any guarantee that syntactic (action) diversity is actually related to likely sample worlds? Cost of generating deterministic plans isn’t exactly too cheap.. –Relaxed reachability style approaches can compute multiple plans (for samples of the worlds) Would relaxation of samples’ plans be better or worse in convergence terms..?

Science may never fully explain who killed JFK, but any explanation must pass the scientific judgement. MDPs may never fully generate policies efficiently but any approach that does must pass MDP judgement.

Mathematical Summary of the Algorithm H-horizon future F H for M = [S,A,T,R] –Mapping of state, action and time (h<H) to a state –S × A × h → S Value of a policy π for F H –R(s,F H, π) V HS (s,H) = E F H [max π R(s,F H,π)] Compare this and the real value V*(s,H) = max π E F H [ R(s,F H,π) ] V FFRa (s) = max F V(s,F) ≥ V HS (s,H) ≥ V*(s,H) Q(s,a,H) = (R(a) + E F H-1 [max π R(a(s),F H-1,π)] ) –In our proposal, computation of max π R(s,F H-1,π) is approximately done by FF [Hoffmann and Nebel ’01] 29 Done by FF Each Future is a Deterministic Problem