TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Contents POMDP Example POMDP Finite World POMDP algorithm Practical Considerations Approximate POMDP Techniques
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) POMDP: Uncertainty in Measurements State Uncertainty in Control Effects Adapt previous Value Iteration Algorithm (VI-VIA)
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) POMDP: World can't be sensed directly Measurements: incomplete, noisy, etc. Partial Observability Robot has to estimate a posterior distribution over a possible world state.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) POMDP: Algorithm to find optimal control policy exit for FINITE WORLD: State space Action space Space of observation Planning horizon Computation is complex For continuous case there are approximations All Finite
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) The algorithm we are going to study all based in Value Iteration (VI). with The same as previous but is not observable Robot has to make decision in the BELIEF STATE Robot’s internal knowledge about the state of the environment Space of posteriori distribution over state
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) So with Control Policy
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) Belief bel Each value in POMDP is function of entire probability distribution Problems: State Space finite Belief Space continuous State Space continuous Belief Space infinitely-dimensional continuum Also complexity in calculate the Value Function Because of the integral over all the distribution
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Partially Observable Markov Decision Processes (POMDP) At the end optimal solution exist for Interesting Special Case of Finite World: state space; action space; space of observations; planning horizon All finite Solution of VF are Piecewise Linear Function over the belief space The previous arrive because Expectation is a linear operation Ability to select different controls in different parts
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP 2 States:3 Control Actions:
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP When execute payoff: Dilemma opposite payoff in each state knowledge of the state translate directly into payoff
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP To acquire knowledge robot has control affects the state of the world in non-deterministic manner: (Cost of waiting, cost of sensing, etc.)
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP Benefit Before each control decision, the robot can sense. By sensing robot gains knowledge about the state Make better control decisions High payoff expectation In the case of control action, robot sense without terminal action
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP The measurement model is governed by the following probability distribution:
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP This example is easy to graph over the belief space (2 states) Belief state
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP Control Policy Function that maps the unit interval [0;1] to space of all actions Example
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice Control Choice ( When to execute what control?) First consider the immediate payoff. Payoff now is a function of belief state So for, the expected payoff Payoff in POMDPs
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice First we calculate the robot simply selects the action of highest expected payoff Piecewise Linear convex Function Maximum of individual payoff function
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice First we calculate the robot simply selects the action of highest expected payoff Piecewise Linear convex Function Maximum of individual payoff function
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice First we calculate the robot simply selects the action of highest expected payoff Transition occurs when in Optimal Policy
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP - Sensing Now we have perception What if the robot can sense before it chooses control? How it affects the optimal Value Function Sensing info about State enable choose better control action In previous example Expected payoff How better will this be after sensing?
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice Belief after sensing as a function of the belief before sensing Given by Bayes Rule Finally
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice How this affects the Value Function?
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice Mathematically That is just replacing by in the Value Function
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice However our interest is the complete Expected Value Function after sensing, that consider also the probability of sensing the other measurement. This is given by:
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice An this results in
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Control Choice Mathematically
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP - Prediction To plan at a horizon larger than we have to take this into consideration and project our value function accordingly According to our transition probability model In between the expectation is linear If
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Prediction An this results in
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Prediction And adding and we have:
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Prediction Mathematically cost Fix!!
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Pruning Full backup : Impractical!!! Efficient approximate POMDP needed
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Finite World POMDP algorithm To understand this read Mathematical Derivation of POMDPs pp in [1]
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Finite World POMDP algorithm To understand this read Mathematical Derivation of POMDPs pp in [1]
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It looks easy let’s try something more “real”… Probabilistic Robot “RoboProb”
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It looks easy let’s try something more “real”… Probabilistic Robot “RoboProb” 11 States: 5 Control Actions: Sense without moving Transition Model
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It looks easy let’s try something more “real”… Probabilistic Robot “RoboProb” “Reward” Payoff The same set for all control action Example
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Transition Probability Example
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Transition Probability Example
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Measurement Probability
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Belief States Impossible to graph!!
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Each linear function results from executing control, followed by observing measurement, and then executing control.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Defining Measurement Probability Defining “Reward” Payoff Defining Transition Probability Merging Transition (Control) Probability
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations It’s getting kind of hard :S… Probabilistic Robot “RoboProb” Setting Beliefs Executing Sensing Executing
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations Now What…? Probabilistic Robot “RoboProb” Calculating The real problem is to compute
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations The real problem is to compute Given a belief and a control action, the outcome is a distribution over distributions. Because belief is also based on the next measurement, the measurement itself is generated stochastically. Key factor in this update is the conditional probability This probability specifies a distribution over probability distributions.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations The real problem is to compute So we make Contain only on non-zero term =
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations The real problem is to compute Arriving to: Just integrate over measurements instead of Because our space is finite we have With
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Example POMDP – Practical Considerations The real problem is to compute At the end we have something So, this VIA is far from practical. For any reasonable number of distinct states, measurements, and controls, the complexity of the value function is prohibitive, even for relatively beginning planning horizons. Need for approximations
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques Here we have 3 approximate probabilistic planning and control algorithms QMDP AMDP MC-POMDP Varying degrees of practical applicability. All 3 algorithms relied on approximations of the POMDP value function. They differed in the nature of their approximations.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - QMDP The QMDP framework considers uncertainty only for a single action choice: Assumes after the immediate next control action, the state of the world suddenly becomes observable. Full observability make possible to use the MDP-optimal value function. QMDP generalizes the MDP value function to belief spaces through the mathematical expectation operator. Planning in QMDPs is as efficient as in MDPs, but the value function generally overestimates the true value of a belief state.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - QMDP Algorithm The QMDP framework considers uncertainty only for a single action choice.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP Augmented-MDP (AMDP) maps the belief into a lower- dimensional representation, over which it then performs exact value iteration. “Classical" representation consists of the most likely state under a belief, along with the belief entropy. AMDPs are like MDPs with one added dimension in the state representation that measures global degree of uncertainty. To implement AMDP, its necessary to learn the state transition and the reward function in the low-dimensional belief space.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP “Classical" representation consists of the most likely state under a belief, along with the belief entropy.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP AMDPs in mobile robot navigation is called coastal navigation. Anticipates uncertainty Selects motion that trades off overall path length with the uncertainty accrued along a path. Resulting trajectories differ significantly from any non- probabilistic solution. Being temporarily lost is acceptable, if the robot can later re-localize with sufficiently high probability.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP AMDP Algorithm
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - AMDP
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - MC-POMDP The Monte Carlo MPOMDP (MC-POMDP) Particle filter version of POMDPs. Calculates a value function defined over sets of particles. MC-POMDPs uses local learning technique, which used a locally weighted learning rule in combination with a proximity test based on KL-divergence. MC-POMDPs then apply Monte Carlo sampling to implement an approximate value backup. The resulting algorithm is a full-fledged POMDP algorithm whose computational complexity and accuracy are both functions of the parameters of the learning algorithm.
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - MC-POMDP particle set representing belief Value Function
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - MC-POMDP MC-POMDP Algorithm
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Approximate POMDP Techniques - MC-POMDP
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology References and Links References [1] Thrun, Burgard, Fox. Probabilistic Robotics. MIT Press, 2005 Links
TKK | Automation Technology Laboratory AS Postgraduate Course in Automation Technology Exercise Exercise 1 in [1] Chapter 15 A person faces two doors. Behind one is a tiger, behind the other a reward of +10. The person can either listen or open one of the doors. When opening the door with a tiger, the person will be eaten, which has an associated cost of -20. Listening costs -1. When listening, the person will hear a roaring noise that indicates the presence of the tiger, but only with 0.85 probability will the person be able to localize the noise correctly. With 0.15 probability, the noise will appear as if it came from the door hiding the reward. Your questions: (a) Provide the formal model of the POMDP, in which you define the state, action, and measurement spaces, the cost function, and the associated probability functions. (b) What is the expected cumulative payoff/cost of the open-loop action sequence: "Listen, listen, open door 1"? Explain your calculation. (c) What is the expected cumulative payoff/cost of the open-loop action sequence: "Listen, then open the door for which we did not hear a noise"? Again, explain your calculation.