TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Slides:



Advertisements
Similar presentations
David Rosen Goals  Overview of some of the big ideas in autonomous systems  Theme: Dynamical and stochastic systems lie at the intersection of mathematics.
Advertisements

Markov Decision Process
Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
Monte Carlo Localization for Mobile Robots Karan M. Gupta 03/10/2004
1 Slides for the book: Probabilistic Robotics Authors: Sebastian Thrun Wolfram Burgard Dieter Fox Publisher: MIT Press, Web site for the book & more.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
Bayes Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the.
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
CS 547: Sensing and Planning in Robotics Gaurav S. Sukhatme Computer Science Robotic Embedded Systems Laboratory University of Southern California
An Introduction to PO-MDP Presented by Alp Sardağ.
SLAM: Simultaneous Localization and Mapping: Part I Chang Young Kim These slides are based on: Probabilistic Robotics, S. Thrun, W. Burgard, D. Fox, MIT.
Monte Carlo Localization
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
A Probabilistic Approach to Collaborative Multi-robot Localization Dieter Fox, Wolfram Burgard, Hannes Kruppa, Sebastin Thrun Presented by Rajkumar Parthasarathy.
Markov Decision Processes
Based on slides by Nicholas Roy, MIT Finding Approximate POMDP Solutions through Belief Compression.
HCI / CprE / ComS 575: Computational Perception
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Bayesian Filtering for Robot Localization
MAKING COMPLEX DEClSlONS
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Markov Localization & Bayes Filtering
From Bayesian Filtering to Particle Filters Dieter Fox University of Washington Joint work with W. Burgard, F. Dellaert, C. Kwok, S. Thrun.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
Mapping and Localization with RFID Technology Matthai Philipose, Kenneth P Fishkin, Dieter Fox, Dirk Hahnel, Wolfram Burgard Presenter: Aniket Shah.
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Solving POMDPs through Macro Decomposition
Maximum a posteriori sequence estimation using Monte Carlo particle filters S. J. Godsill, A. Doucet, and M. West Annals of the Institute of Statistical.
U NCERTAINTY IN S ENSING ( AND ACTION ). P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Probabilistic Robotics
State Estimation and Kalman Filtering Zeeshan Ali Sayyed.
CS 547: Sensing and Planning in Robotics Gaurav S. Sukhatme Computer Science Robotic Embedded Systems Laboratory University of Southern California
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.
POMDPs Logistics Outline No class Wed
Probabilistic Robotics
Markov ó Kalman Filter Localization
Markov Decision Processes
Robust Belief-based Execution of Manipulation Programs
Particle Filter/Monte Carlo Localization
Markov Decision Processes
A Short Introduction to the Bayes Filter and Related Models
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Contents POMDP Example POMDP Finite World POMDP algorithm Practical Considerations Approximate POMDP Techniques

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) POMDP:  Uncertainty in Measurements  State  Uncertainty in Control Effects Adapt previous Value Iteration Algorithm (VI-VIA)

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) POMDP:  World can't be sensed directly Measurements: incomplete, noisy, etc. Partial Observability  Robot has to estimate a posterior distribution over a possible world state.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) POMDP:  Algorithm to find optimal control policy exit for FINITE WORLD: State space Action space Space of observation Planning horizon  Computation is complex  For continuous case there are approximations All Finite

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) The algorithm we are going to study all based in Value Iteration (VI). with  The same as previous but is not observable Robot has to make decision in the BELIEF STATE  Robot’s internal knowledge about the state of the environment  Space of posteriori distribution over state

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) So with Control Policy

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) Belief  bel   Each value in POMDP is function of entire probability distribution Problems:  State Space finite  Belief Space continuous  State Space continuous  Belief Space infinitely-dimensional continuum  Also complexity in calculate the Value Function Because of the integral over all the distribution

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) At the end  optimal solution exist for Interesting Special Case of Finite World:  state space; action space; space of observations; planning horizon  All finite Solution of VF are Piecewise Linear Function over the belief space  The previous arrive because Expectation is a linear operation Ability to select different controls in different parts

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP 2 States:3 Control Actions:

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP When execute payoff: Dilemma  opposite payoff in each state  knowledge of the state translate directly into payoff

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP To acquire knowledge robot has control affects the state of the world in non-deterministic manner: (Cost of waiting, cost of sensing, etc.)

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP Benefit  Before each control decision, the robot can sense. By sensing robot gains knowledge about the state  Make better control decisions  High payoff expectation In the case of control action, robot sense without terminal action

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP The measurement model is governed by the following probability distribution:

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP This example is easy to graph over the belief space (2 states) Belief state

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP Control Policy  Function that maps the unit interval [0;1] to space of all actions Example

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice Control Choice ( When to execute what control?)  First consider the immediate payoff.  Payoff now is a function of belief state So for, the expected payoff Payoff in POMDPs

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice First we calculate  the robot simply selects the action of highest expected payoff Piecewise Linear convex Function Maximum of individual payoff function

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice First we calculate  the robot simply selects the action of highest expected payoff Piecewise Linear convex Function Maximum of individual payoff function

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice First we calculate  the robot simply selects the action of highest expected payoff Transition occurs when in Optimal Policy

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP - Sensing Now we have perception  What if the robot can sense before it chooses control?  How it affects the optimal Value Function Sensing  info about State  enable choose better control action In previous example Expected payoff How better will this be after sensing?

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice Belief after sensing as a function of the belief before sensing Given by Bayes Rule Finally

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice How this affects the Value Function?

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice Mathematically That is just replacing by in the Value Function

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice However our interest is the complete Expected Value Function after sensing, that consider also the probability of sensing the other measurement. This is given by:

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice An this results in

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice Mathematically

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP - Prediction To plan at a horizon larger than we have to take this into consideration and project our value function accordingly According to our transition probability model In between the expectation is linear If

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Prediction An this results in

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Prediction And adding and we have:

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Prediction Mathematically cost Fix!!

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Pruning Full backup : Impractical!!! Efficient approximate POMDP needed

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Finite World POMDP algorithm To understand this read Mathematical Derivation of POMDPs pp in [1]

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Finite World POMDP algorithm To understand this read Mathematical Derivation of POMDPs pp in [1]

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It looks easy let’s try something more “real”… Probabilistic Robot “RoboProb”

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It looks easy let’s try something more “real”… Probabilistic Robot “RoboProb” 11 States: 5 Control Actions: Sense without moving Transition Model

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It looks easy let’s try something more “real”… Probabilistic Robot “RoboProb” “Reward”  Payoff The same set for all control action Example

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Transition Probability Example 

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Transition Probability Example 

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Measurement Probability

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Belief States Impossible to graph!!

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Each linear function results from executing control, followed by observing measurement, and then executing control.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Defining Measurement Probability Defining “Reward” Payoff Defining Transition Probability Merging Transition (Control) Probability

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Setting Beliefs Executing Sensing Executing

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations Now What…? Probabilistic Robot “RoboProb” Calculating The real problem is to compute 

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations The real problem is to compute  Given a belief and a control action, the outcome is a distribution over distributions. Because belief is also based on the next measurement, the measurement itself is generated stochastically. Key factor in this update is the conditional probability This probability specifies a distribution over probability distributions.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations The real problem is to compute  So we make Contain only on non-zero term =

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations The real problem is to compute  Arriving to: Just integrate over measurements instead of Because our space is finite we have With

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations The real problem is to compute  At the end we have something  So, this VIA is far from practical.  For any reasonable number of distinct states, measurements, and controls, the complexity of the value function is prohibitive, even for relatively beginning planning horizons.  Need for approximations

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques Here we have 3 approximate probabilistic planning and control algorithms  QMDP  AMDP  MC-POMDP Varying degrees of practical applicability. All 3 algorithms relied on approximations of the POMDP value function. They differed in the nature of their approximations.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - QMDP The QMDP framework considers uncertainty only for a single action choice:  Assumes after the immediate next control action, the state of the world suddenly becomes observable.  Full observability make possible to use the MDP-optimal value function.  QMDP generalizes the MDP value function to belief spaces through the mathematical expectation operator.  Planning in QMDPs is as efficient as in MDPs, but the value function generally overestimates the true value of a belief state.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - QMDP Algorithm The QMDP framework considers uncertainty only for a single action choice.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP Augmented-MDP (AMDP) maps the belief into a lower- dimensional representation, over which it then performs exact value iteration. “Classical" representation consists of the most likely state under a belief, along with the belief entropy. AMDPs are like MDPs with one added dimension in the state representation that measures global degree of uncertainty. To implement AMDP, its necessary to learn the state transition and the reward function in the low-dimensional belief space.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP “Classical" representation consists of the most likely state under a belief, along with the belief entropy.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP AMDPs in mobile robot navigation is called coastal navigation. Anticipates uncertainty Selects motion that trades off overall path length with the uncertainty accrued along a path. Resulting trajectories differ significantly from any non- probabilistic solution. Being temporarily lost is acceptable, if the robot can later re-localize with sufficiently high probability.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP AMDP Algorithm

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - MC-POMDP The Monte Carlo MPOMDP (MC-POMDP) Particle filter version of POMDPs. Calculates a value function defined over sets of particles. MC-POMDPs uses local learning technique, which used a locally weighted learning rule in combination with a proximity test based on KL-divergence. MC-POMDPs then apply Monte Carlo sampling to implement an approximate value backup. The resulting algorithm is a full-fledged POMDP algorithm whose computational complexity and accuracy are both functions of the parameters of the learning algorithm.

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - MC-POMDP particle set representing belief Value Function

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - MC-POMDP MC-POMDP Algorithm

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - MC-POMDP

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology References and Links References [1] Thrun, Burgard, Fox. Probabilistic Robotics. MIT Press, 2005 Links

TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Exercise Exercise 1 in [1] Chapter 15 A person faces two doors. Behind one is a tiger, behind the other a reward of +10. The person can either listen or open one of the doors. When opening the door with a tiger, the person will be eaten, which has an associated cost of -20. Listening costs -1. When listening, the person will hear a roaring noise that indicates the presence of the tiger, but only with 0.85 probability will the person be able to localize the noise correctly. With 0.15 probability, the noise will appear as if it came from the door hiding the reward. Your questions: (a) Provide the formal model of the POMDP, in which you define the state, action, and measurement spaces, the cost function, and the associated probability functions. (b) What is the expected cumulative payoff/cost of the open-loop action sequence: "Listen, listen, open door 1"? Explain your calculation. (c) What is the expected cumulative payoff/cost of the open-loop action sequence: "Listen, then open the door for which we did not hear a noise"? Again, explain your calculation.