Ames Research Center Planning with Uncertainty in Continuous Domains Richard Dearden No fixed abode Joint work with: Zhengzhu Feng U. Mass Amherst Nicolas.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Advertisements

Markov Decision Process
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Partially Observable Markov Decision Process (POMDP)
Timed Automata.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Dynamic Bayesian Networks (DBNs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Reinforcement learning
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Planning to Gather Information
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discretization Pieter Abbeel UC Berkeley EECS
Probabilistic Temporal Planning with Uncertain Durations Mausam Joint work with Daniel S. Weld University of Washington Seattle.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Instructor: Vincent Conitzer
The Stochastic Motion Roadmap: A Sampling Framework for Planning with Markov Motion Uncertainty Changsi An You-Wei Cheah.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Ames Research Center Incremental Contingency Planning Richard Dearden, Nicolas Meuleau, Sailesh Ramakrishnan, David E. Smith, Rich Washington window [10,14:30]
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Heuristic Search Planners. 2 USC INFORMATION SCIENCES INSTITUTE Planning as heuristic search Use standard search techniques, e.g. A*, best-first, hill-climbing.
Copyright © Cengage Learning. All rights reserved. Graphs; Equations of Lines; Functions; Variation 3.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Modelling & Simulation of Semiconductor Devices Lecture 1 & 2 Introduction to Modelling & Simulation.
POMDPs Logistics Outline No class Wed
Reinforcement learning (Chapter 21)
Reinforcement Learning
CAP 5636 – Advanced Artificial Intelligence
Markov Decision Problems
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Ames Research Center Planning with Uncertainty in Continuous Domains Richard Dearden No fixed abode Joint work with: Zhengzhu Feng U. Mass Amherst Nicolas Meuleau, Dave Smith NASA Ames Richard Washington Google

Ames Research Center Motivation Panorama Image rockImage Rock Dig Trench ? Problem: Scientists are interested in many potential targets. How to decide which to pursue?

Ames Research Center Motivation Panorama Image rock Image Rock Dig Trench ? Time? Power? Likelihood of Success? Different value targets

Ames Research Center Outline  Introduction  Problem Definition  A Classical Planning Approach  The Markov Decision Problem approach  Final Comments

Ames Research Center Problem Definition  Aim: To select a “plan” that “maximises” long-term expected reward received given: Limited resources (time, power, memory capacity). Uncertainty about the resources required to carry out each action (“how long will it take to drive to that rock?”). Hard safety constraints over action applicability (must keep enough reserve power to maintain the rover). Uncertain action outcomes (some targets may be unreachable, instruments may be impossible to place).  Difficulties: Continuous resources. Actions have uncertain continuous outcomes. Goal selection and optimization Also possibly concurrency, …

Ames Research Center Possible Approaches  Contingency Planning: Generate a single plan, but with branches. Branch based on the actual outcome of the actions performed so far in the plan.  Policy-based Planning: A plan is now a policy: a mapping from states to actions. There’s something to do no matter what the outcome of the actions so far. More general, but harder to compute. Power > 5Ah Power  5 Ah

Ames Research Center An Example Problem Drive (-2)Dig(60)Visual servo (.2, -.15)NIR Lo resRock finderNIR E >.1 Ah  =.05 Ah  =.02 Ah E >.6 Ah  =.2 Ah  =.2 Ah  = 40s  = 20s  = 60s  = 1s E > 10 Ah  = 5 Ah  = 2.5 Ah  = 1000s  = 500s V = 100 t  [9:00, 16:00]  = 5s  = 1s E >.02 Ah  =.01 Ah  = 0 Ah  = 120s  = 20s E >.12 Ah  =.1 Ah  =.01 Ah V = 50 HiRes V = 10 E > 3 Ah  = 2 Ah  =.5 Ah t  [10:00, 13:50]  = 600s  = 60s t  [10:00, 14:00]  = 600s  = 60s E > 3 Ah  = 2 Ah  =.5 Ah t  [9:00, 14:30]  = 5s  = 1s E >.02 Ah  =.01 Ah  = 0 Ah V = 5

Ames Research Center Value Function Expected Value Power Start time :20 14:40 14:20 14:00 13:40

Ames Research Center Value Function Power Start time :20 14:40 14:20 14:00 13:40 Drive (-2)Dig(60)Visual servo (.2, -.15)NIR Lo resRock finderNIR E >.1 Ah  =.05 Ah  =.02 Ah E >.6 Ah  =.2 Ah  =.2 Ah  = 40s  = 20s  = 60s  = 1s E > 10 Ah  = 5 Ah  = 2.5 Ah  = 1000s  = 500s V = 100 t  [9:00, 16:00]  = 5s  = 1s E >.02 Ah  =.01 Ah  = 0 Ah  = 120s  = 20s E >.12 Ah  =.1 Ah  =.01 Ah V = 50 HiRes V = 10 E > 3 Ah  = 2 Ah  =.5 Ah t  [10:00, 13:50]  = 600s  = 60s t  [10:00, 14:00]  = 600s  = 60s E > 3 Ah  = 2 Ah  =.5 Ah t  [9:00, 14:30]  = 5s  = 1s E >.02 Ah  =.01 Ah  = 0 Ah V = 5

Ames Research Center Plans Drive (-2)Dig(60)Visual servo (.2, -.15)NIR Lo resRock finderNIR Time > 13:40 or Power < 10  Contingency Planning:  Policy-based Planning: Regions of state space have corresponding actions. VisualServo Lo-Res Hi-Res Time 10 : VisualServo Time > 14:15 and Time 10 : Hi-Res …

Ames Research Center Contingency Planning 1. Seed plan 2.Identify best branch point 3.Generate a contingency branch 4.Evaluate & integrate the branch ? ? ? ? r VbVb VmVm Construct plangraph Back-propagate value tables Compute gain

Ames Research Center Construct Plangraph g1g1 g2g2 g3g3 g4g4

Ames Research Center Add Resource Usages and Values g1g1 g2g2 g3g3 g4g4 V1V1 V2V2 V3V3 V4V4

Ames Research Center Value Graphs g1g1 g2g2 g3g3 g4g4 V1V1 V2V2 V3V3 V4V4 r r r r

Ames Research Center Propagate Value Graphs g1g1 g2g2 g3g3 g4g4 V1V1 V2V2 V3V3 V4V4 r r r r v r v r v r

Ames Research Center p r V p r v r v r 515 v r Simple Back-propagation

Ames Research Center p r V p r v r v r 515 v r r > 15 Constraints

Ames Research Center p r V p r v r 515 v r p q t s v r 515 v r {t} p r 5 {q} v r {q} {t} Conjunctions

Ames Research Center p r V p r v r p q t s p r 5 v r {q} {t} v r v r Back-propagating Conditions

Ames Research Center p r V p r v r p q t s p r 5 v r {q} {t} r 30 v v 15 v r v r Back-propagating Conditions

Ames Research Center B D A C CDAB CABD CADB ACBD ACDB ABCD Which Orderings

Ames Research Center v2v2 r p r p r v1v1 r 510 v2v2 r 20 r v1v1 v2v2 r 1020 v1v1 p r 5 v2v2 r 1020 v1v1 Max Combining Tables

Ames Research Center v2v2 r p r p r v1v1 r 510 v2v2 r 20 r v1v1 p r 5 v2v2 r 1020 v1v1 v 1 + v 2 30 v2v2 r 1020 v1v1 v 1 + v 2 30 Achieving Both Goals

Ames Research Center V1V1 V2V2 V3V3 V4V4 V r V r V r V r Max Estimating Branch Value

Ames Research Center r V1V1 V2V2 V3V3 V4V4 r P r plan value function resource probability VmVm VbVb Estimating Branch Value

Ames Research Center r V1V1 V2V2 V3V3 V4V4 VbVb r P r Gain = ∫ P(r) max{0,V b (r) - V m (r)} dr ∞ 0 VmVm VbVb Expected Branch Gain

Ames Research Center Heuristic Guidance  Plangraphs generally used as heuristics – the plans they produce may not be executable: Not all orderings considered. All the usual plangraph limitations: –Delete lists generally not considered. –No mutual-exclusion representation. Discrete outcomes not (currently) handled. –Action uncertainty is only in resource usage, not resulting state.  Output used as heuristic guidance for classical planner: Start state Goal(s) to achieve  Result is an executable plan of high value! Drive (-1)Dig(5)Visual servo (.2, -.15)Hi res Lo resRock finderNIR

Ames Research Center Expected Value Power Start time :20 14:40 14:20 14:00 13:40 Evaluating the final plan  Plangraph gives a heuristic estimate of the value of the plan.  Better estimate can be computed using Monte-Carlo techniques, but these are quite slow for a multi-dimensional continuous problem.  Figure required 500 samples per point, 4000x2000 points, so simulation of every branch of the plan 4 thousand million times. Slow!

Ames Research Center Outline  Introduction  Problem Definition  A Classical Planning Approach  The Markov Decision Problem approach  Final Comments

Ames Research Center MDP Approach: Motivation Expected Value Power Start time :20 14:40 14:20 14:00 13:40 Constant value function throughout region. Wouldn’t it be nice to only compute the value once! Approach: Exploit the structure in the problem to find constant (or linear regions).

Ames Research Center Continuous MDPs  States: X = {X 1,X 2,...,X n }  Actions: A = {a 1, a 2,..., a m }  Transition: P a (X 0 |X)  Reward: R a (X)  Dynamic programming (Bellman Backup):  Can’t be computed in general without discretization

Ames Research Center Symbolic Dynamic Programming  Special representation of transition, reward and value using MTBDDs for discrete variables, kd-trees for continuous.  Representation makes problem structure (if any) explicit.  Dynamic programming on both the value function and the structured representation.  Idea is to do all operations of Bellman equation in MTBDD/kd-tree form.

Ames Research Center  Requires rectangular transition, reward functions: Continuous State Abstraction  Transition probabilities remain constant (relative to current value) over region.  Transition function is discrete: approximate continuous functions by discretizing. Required so family of value functions is closed under the Bellman Equation.

Ames Research Center  Requires rectangular transition, reward functions: Continuous State Abstraction  Reward function piecewise constant or linear over region.  This, along with discrete transition function, ensures all value functions computed using Bellman equation are also piecewise constant or linear.  Approach is to compute exact solution to approximate model.

Ames Research Center Value Iteration  Theorem: If V n is rectangular PWC (PWL), then V n+1 is rectangular PWC (PWL). P a V n V n+1  Represent rectangular partitions using kd-trees.

Ames Research Center Partitioning

Ames Research Center Performance: 2 Continuous Variables

Ames Research Center Performance: 3 Continuous Variables  For naïve, we just discretize everything at the given input resolution.  For the others, we discretize the transition functions at that resolution, but the algorithm may increase the resolution to accurately represent that final value function. This means that the value function is actually more accurate than for the naïve algorithm.

Ames Research Center Final Remarks  Plangraph–based approach: Produces “plans” - easy for people to interpret. Fast heuristic estimate of the value of a plan/plan fragment. Need an effective way to evaluate actual values to really know a branch is worthwhile. Efficient representation for problems with many goals. Still missing discrete action outcomes  MDP-based approach: Produces optimal policies – the best you could possibly do. Faster, more accurate value fn. computation (if there’s structure). Hard to represent some problems effectively (e.g. fact that goals are worth something only before you reach them). Policies are hard to interpret by humans.  Can be combined: Use MDP approach to evaluate quality of plans/plan fragments.

Ames Research Center Future Work  We approximate by building an approximate model, then solving it exactly. One could also approximately solve the exact model.  The plangraph approach takes advantage of the current system state when planning to narrow the search. The MDP policy probably includes value computation for many unreachable states.  Preference elicitation is very important here. With many goals we need good estimates of their value.  This is part of a greater whole—rover planning problems. Is the policy sufficiently efficiently encoded to transmit to the rover? How much more complex does the executive need to be to carry out a contingent plan?