CS594 Automated decision making University of Illinois, Chicago

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
An Introduction to PO-MDP Presented by Alp Sardağ.
Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
U NCERTAINTY IN S ENSING ( AND ACTION ). P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion.
MDPs (cont) & Reinforcement Learning
Decision Making Under Uncertainty Lec #10: Partially Observable MDPs UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Making complex decisions
POMDPs Logistics Outline No class Wed
Markov Decision Processes
Biomedical Data & Markov Decision Process
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
Heuristic Search Value Iteration
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

CS594 Automated decision making University of Illinois, Chicago Planning and Acting in Partially Observable Stochastic Domains Lesli P. Kaelbling, Michael L.Littman, Anthony R.Cassandra CS594 Automated decision making course presentation Professor: Piotr. Hui Huang University of Illinois, Chicago Oct 30, 2002 Hello, everyone Happy Halloween first. And now I’m here today to introduce the famous paper… by…

Outline MDP & POMDP Basic POMDP  Belief Space MDP Policy Tree Piecewise Linear Value Function Tiger Problem Before we start, let’s look at the outline for this presentation. As we already know about MDP and POMDP basic knowledge from the previous lectures, and got to know Famous  vectors from Brad… and got to know some fresh ideas about piecewise linear value function from Mingqing … And here It is our time to go deeply to look at these concepts again based on the new interpretation: policy tree and think about them as a whole. First, let us quickly go over these knowledge again and get a more comprehensive understanding on these topics from authors’ point of views. Second, for complex POMDPs problem, this paper introduce an original idea for solving this problem. … For solving this belief MDP problem, we will first look at these 2 properties …, … and then… Finally, let us look at this very interesting problem and see how it will be solved using the above mechanism… Presentation: Planning and Acting in Partially Observable Stochastic Domains

MDP Model Process: Observe state st in S Choose action at in At Receive immediate reward rt State changes to st+1 MDP Model <S, A, T, R> Agent State Action Reward Environment r0 a0 s1 a1 r1 s2 a2 r2 s3 s0 As we already know, MDP serves as a basis for solving the more complex POMDPs problem which we are finally interested in. From this figure, we can see that an MDP is a model of agent interacting synchronously with the world. Explain …( combine the process) 2. So the MDP Model can be expressed in mathematical form <S, A, T, R> we have seen several times before. But keep in mind, Markov property is maintained in this Model. - the state and reward at time t+1 is dependent only on the state at time t and the action a at the time t. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Policy & Value Function Which action, a, should the agent take? In MDPs: Policy is a mapping from state to action, : S  A Value Function Vt(S) given a policy  The expected sum of reward gained from starting in state s executing non-stationary policy  for t steps. Relation Value function is evaluation for policy based no the long-run value that agent expects to gain from execute the policy. 1. In an MDP, our goal can be Finding optimal policy for all possible states or Finding the sequence of optimal value functions. As we have found, the concept a policy is closely related with the concept of value function. - Policy / Optimal Policy definition… - value function / optimal value function… which is a mapping from a state to a real number given a policy. 2. Relation: Each policy has an associated value function and the better the policy, the better its value function. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Optimization (MDPs) Recursively calculate expected long-term reward for each state/belief: Find the action that maximizes the expected reward: 1. So based on policy and value function above, One can construct a value function given a policy or one can construct a policy given a value function The better the value function, the better the policy constructed. This will leads us to find optimal policy and Sequence of Optimal value functions for a given starting state. Thus, we reached our goal… 2. For algorithms, we have familiar with value iteration which I don’t show it here. As we already know, It improves value functions in an iterative fashion. Each iteration is referred to as dynamic-programming update (DP update). It stops when the current value function is sufficiently close to the optimal. Presentation: Planning and Acting in Partially Observable Stochastic Domains

POMDP: UNCERTAINTY Case #1: Uncertainty about the action outcome Case #2: Uncertainty about the world state due to imperfect (partial) information Now, we enter into POMDP problem. There are 2 uncertainty cases in this kind of problem. For case 1:…., the same one as that in MDP. For case 2: … Presentation: Planning and Acting in Partially Observable Stochastic Domains

A broad perspective Belief state STATE AGENT OBSERVATIONS Belief state AGENT STATE WORLD + AGENT So, In a POMDP, on the other hand, the current feedback does not provide sufficient information about world states. The question of “how would agent select actions?” would raises. A better thinking is that agent can take Information from previous steps (history) into consideration based on observation of the real world states. So All such information can be summarized by a probability distribution over the set of states. In the literature, this probability distribution is often referred to as a belief state. Click… As the figure here shows that the agent only make actions according to the input belief state from observation on real world state. What is exact belief state? How POMDP model works via belief state? How to find optimal policy? Answers for such questions will be given in later slides for details. In MDPs our goal is to find a mapping from states to actions; here instead is to find a mapping from probability distributions (over states) to appropriate actions. ACTIONS GOAL = Selecting appropriate actions Presentation: Planning and Acting in Partially Observable Stochastic Domains

What are POMDPs? S2 S1 S3 Components: Set of states: sS Set of actions: aA Set of observations: o 0.5 Pr(o1)=0.9 Pr(o2)=0.1 0.5 S1 a1 Pr(o1)=0.5 Pr(o2)=0.5 S3 a2 1 Pr(o1)=0.2 Pr(o2)=0.8 POMDP parameters: Initial belief: b0(s)=Pr(S=s) Belief state updating: b’(s’)=Pr(s’|o, a, b) Observation probabilities: O(s’,a,o)=Pr(o|s’,a) Transition probabilities: T(s,a,s’)=Pr(s’|s,a) Rewards: R(s,a) So, formally we define POMDPS in this slide. Basically, There are 3 components… Let’s examine the parrameters needed in POMDPS - for belief state we should have initial belief and belief state updating function for computing new belief state. - O(s’,a, o) represents for the probability of making observation o given the agent took action a and landed In state s’ - transition probabilites measures the agent move to the new state s’ from state s after taking action a. - reward R(s, a) specifies immediate reward after taking action a in a state s. State transition is shown in this figure. (Explain them in detail….) MDP Presentation: Planning and Acting in Partially Observable Stochastic Domains

Belief state Probability distributions over states of the underlying MDP The agent keeps an internal belief state, b, that summarizes its experience. The agent uses a state estimator, SE, for updating the belief state b’ based on the last action at-1, the current observation ot, and the previous belief state b. Belief state is a sufficient statistic (it satisfies the Markov proerty) Action World Observation b SE Agent Let’s go deeply for belief state concept. Belief states is the probability…. From this figure, we can easily find that how does belief state come from ? … So Note that the policy , is the function of belief state rather than the state of the world. … Presentation: Planning and Acting in Partially Observable Stochastic Domains

1D belief space for a 2 state POMDP We have already seen this graphical representation for belief space and belief state updating from Mingqing’s lectures Let’s look at belief space belief space is the set of all possible probability distributions For a simple case: a two state POMDP we can represent the belief state with a single number. Since a belief state is a probability distribution, the sum of all probabilities must sum to 1. So With a two state POMDP, if we are given the probability for being in one of the states as being 'p', then we know that the probability of being in the other state must be '1-p'. Therefore the entire space of belief states can be represented as a line segment. 2. Explain belief state updating Next go back to the updating of the belief state discussed earlier. Assume we start with a particular belief state b and we take action a1 and receive observation z1 after taking that action. Then our next belief state is fully determined. In fact, since we are assuming that there are a finite number of actions and a finite number of observations, given a belief state, there are a finite number of possible next belief states. These correspond to each combination of action and observation. So In the same case , this figure here also shows this process for a POMDP. We have two states (s1 and s2), two possible actions (a1 and a2) and three possible observations (z1, z2 and z3). The starting belief state is the big yellow dot and the resulting belief states are the smaller black dots. The arcs represent the process of transforming the belief state. The very nice formula for computing new degree of belief in some state s’ is given blow. How does this formula infer is shown in Martai’s lectures already. (basically use Bayes’ Rule and some tricky inference) Presentation: Planning and Acting in Partially Observable Stochastic Domains

POMDP  Continuous-Space Belief MDP a POMDP can be seen as a continuous-space “belief MDP”, as the agent’s belief is encoded through a continuous “belief state”. We may solve this belief MDP like before using value iteration algorithm to find the optimal policy in a continuous space. However, some adaptations for this algorithm are needed. S0 S1 A1 O1 S3 A3 O3 S2 A2 O2 So It turns out that the process of maintaining the belief state is Markovian; the next belief state depends only on the current belief state (and the current action and observation). In fact, we can convert a discrete POMDP problem into a continuous space Belief MDP problem where the continuous space is the belief space. The transitions of this new continuous space Belief MDP are easily derived from the transition and observation probabilities of the POMDP. And now What this means is that we are now back to solving a CO-MDP and we can use the value iteration (VI) algorithm. The Bellman equation…. However, we will need to adapt the algorithm some. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Belief MDP The policy of a POMDP maps the current belief state into an action. As the belief state holds all relevant information about the past, the optimal policy of the POMDP is the the solution of (continuous-space) belief MDP. A belief MDP is a tuple <B, A, , P>: B = infinite set of belief states A = finite set of actions (b, a) = (reward function) P(b’|b, a) = (transition function) Where P(b’|b, a, o) = 1 if SE(b, a, o) = b’, P(b’|b, a, o) = 0 otherwise; And now we can get the formal mathematical representation for Belief MDP model. … It defines as the following. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Thinking(Can we solving this Belief MDP?) The Bellman equation for this belief MDP is In general case: Very Hard to solve continuous space MDPs. Unfortunately, DP updates cannot be carried out because there are uncountably many of belief states. One cannot enumerate every equation of value function. To conduct DP steps implicitly, we need to discuss the properties of value functions. Some special properties can be exploited first to simplify this problem Policy Tree Piecewise linear and convex property And then find an approximation algorithm to construct the optimal t-step discounted value function over belief space using value iteration… For solving this problem, Value iteration is a standard approach. As we already know that, It consists of two basic steps to compute a near optimal value function: dynamic programming (DP) update and termination test. The big problem using value iteration here is the continuous state space. Since we now have a continuous space the value function is some arbitrary function over belief space. If the POMDP has n states, the belief state MDP has an n-dimensional continuous state space! One cannot enumerate every equation of value function. So Continueous MDPs are hard to slove in general So the following section will study the properties of value functions. …. Due to them, To conduct DP steps implicitly, we need to discuss the properties of Finally… Presentation: Planning and Acting in Partially Observable Stochastic Domains

Policy Tree With one step remaining, the agent must take a single action. With 2 steps to go, it takes an action, make an observation, and makes the final action. In general, an agent t-step policy can be represented as a policy tree. A T steps to go O1 Ok O2 T-1 steps to go A A A This is the an agent’s non-stationary t-step policy can be represented as policy tree. So policy tree is just a graphical representation for policy. no matter for MDP or POMDP problems shown here. Why we say that? For MDP problem, we don’t have observation probability, the agent can get fully information about where they are. So these substrees of one root only represented possible actions next. For POMDP problem, we have observation probabilty, next actions also depends on observation probablity here. We can say that there is an uncertainty for actions. Since there is also a uncertainty for states, which does not show explicitly in this picture. As we know that, with one step left, …. In general, an agent’s non-stationary t-step policy can be represented as policy tree shown here. It is a tree of dept t. Top node determine the first action to be taken. Then, depending on resulting observation, an arc is followed to a node on the next level, which determines the next action. Formally, the t-step policy tree has two components: an action a for the current time point t and (t-1)-step policy trees for the next n-1 time points based on possible observation. So by simply altering the actions on the action nodes, one obtains different n-step policy trees. 2 steps to go (T=2) A O1 Ok O2 1 step to go (T=1) A A A Presentation: Planning and Acting in Partially Observable Stochastic Domains

Value Function for policy tree p If p is one-step policy tree, then the value of executing that action in state s is Vp(s) = R(s, a(p)). More generally, if p is a t-step policy tree, then Vp(s) = R(s, a(p)) + r (Expected value of the future) = R(s, a(p)) + r Thus, Vp(s) can be thought as a vector associated with the policy trees p since its dimension is the same as the number of states. We often use notation p to refer to this vectors. After knowing about this property, now, what is the expected discounted value to be gained from executing a policy tree p? 1. For the simplest case, … Where a(p) is the action specified in the top node of policy tree p. 2. More generally … Value function should sum the immediate reward after perform action in the root of policy tree and the expected discounted rewards the agent receives during the next n-1 time points if the world is currently in state s and the agent behaves according to the policy tree p . Let’s look at the equation for computing expected value of the future. … It firstly taking an expectation over possible next states, s’, from the given start state s. And then considering the value for each of those target states, the value depends on which subpolicy tree will be executed based on observation. So, we take another expecation of the value of executing the associated subtree, Oi(p), starting in state s’ , with respect to all the possible observations. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Value Function Over Belief Space As the exact world cannot be observed, the agent must compute an expectation over world states of executing policy tree p from belief state b: Vp(b) = If we let , then To construct an optimal t-step policy, we must maximize over all t-step policy trees P: As Vp(b) is linear in b for each pP, Vt(b) is the upper surface of those functions. That is, Vt(b) is piecewise linear and convex. Because the agent will never know the exact state of the world, it must be able to determine the value of execute a policy tree p from some belief state b. This is just an expectation over world states of executing p in each state. OK, let’s how to get the value function in MingQing’s lecture. If we let … and represent the belief state as a vector (the probability at each state), then we got the value of a belief point b for given policy tree p is simply the dot product of the two vectors. To find an optimal t-step policy, we simply compute the maximum value which can be done when searching in the set P of all possible t-step policy trees. As we already know that belief space is continuous. So Vp(b) is linear in b for each p..Vt(b) is upper suface of those linear functions. So, Vt(b) is piecewise linear and convex. .. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Illustration: Piecewise Linear Value Function Let Vp1Vp2 and Vp3 be the value functions induced by policy trees p1, p2, and p3. Each of these value functions are the form Which is a multi-linear function of b. Thus, for each value function can be shown as a line, plane, or hyperplane, depending on the number of states, and the optimal t-step value Example– in the case of two states (b(s1) = 1- b(s2)) – be illustrated as the upper surface in two dimensions: In order to know more about this property for value function: piecewise linear, Let’s look at a simple example for illustration. Suppose Vp1, Vp2. … Presentation: Planning and Acting in Partially Observable Stochastic Domains

Picture: Optimal t-step Value Function Vp1 Vp3 Vp2 After showing value function as lines for each policy p1, p2, p3. We can find that there exists the split for belief space. For each belief region, So the optimal t-step Value is to the maximum value based on selecting the best policy. Final optimal t-step value function can be shown in red line .  Belief 1 B(s0) S=S1 S=S0 Presentation: Planning and Acting in Partially Observable Stochastic Domains

Optimal t-Step Policy Vp3 Vp1 Vp2 The optimal t-step policy is determined by projecting the optimal value function back down onto the belief space. The projection of the optimal t-step value function yields a partition into regions, within each of which there is a single policy tree, p, such that is maximal over the entire region. The optimal action in that region a(p), the action in the root of the policy tree p. Vp3 Vp1 Vp2 A(p1) A(p2) A(p3) Belief 1 B(s0) Until now, we have examined 2 important properties for belief MDP. Policy tree Piecewise linear value function So, the optimal t-step policy… … Thus, we got the optimal policy, mapping from belief state to actions. Here we don’t introduce detailed value iteration alogrithms for solving POMDPS problem. U can find them in the paper. Presentation: Planning and Acting in Partially Observable Stochastic Domains

First Step of Value Iteration One-step policy trees are just actions: To do a DP backup, we evaluate every possible 2-step policy tree a0 a1 a0 a1 O0 O1 a0 O0 O1 a0 a1 O0 O1 a0 a1 O0 O1 a1 a0 O0 O1 a1 O0 O1 a1 a0 O0 O1 a1 a0 O0 O1 Let’s see the process of computing value function for POMDP using value iteration, with the same basic structure as for the discrete MDP case. One of simplest algorithm for solving this problem, which we can exhautive enumeration and then prune it. Basic ideas behind this algorithm is the following: Vt-1, the set of useful (t-1)-step policy trees, can be used to constructed a superset Vt+ of the useful t-step policy trees. A t-step policy tree is composed of a root node with an associated action a and || subtree, each a (t-1) step policy tree. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Pruning Some policy trees are dominated and are never useful Vp3 Vp1 They are pruned and not considered on the next step. The key idea is to prune before evaluating (Witness and Incremental Pruning do this) Belief B(s0) 1 Vp3 Vp1 Vp2 1. Value function dominated…. For the whole belief space 2. Parsimonious representation of the value function over the whole belief space: Given a set of policy tree V, it is possible to define a unique minimal subset V that represents the same value function. 3. A policy tree is useful if it is a component of the parsimonious representation of the value function. 4. Although It keeps parsimonious representations of the value functions at each step, there still have more much work to do. Even if Vt is very small, it goes through the step of generating Vt+, which always has size exponential in |o|. So this paper raised a novel algorithm called witness algorithm to attempt to be more efficient than the approach of exhaustively generating Vt+ Presentation: Planning and Acting in Partially Observable Stochastic Domains

Value Iteration (Belief MDP) Keep doing backups until the value function doesn’t change much anymore In the worst case, you end up considering every possible policy tree But hopefully you will converge before this happens Presentation: Planning and Acting in Partially Observable Stochastic Domains

A POMDP example: The tiger problem S0 “tiger-left” Pr(o=TL | S0, listen)=0.85 Pr(o=TR | S1, listen)=0.15 S1 “tiger-right” Pr(o=TL | S0, listen)=0.15 Pr(o=TR | S1, listen)=0.85 Actions={ 0: listen, 1: open-left, 2: open-right} Finally, let’s look at this interesting POMDPs problem : tiger problem. We find examine the properties of POMDP policy discussed above. Imagine an agent standing in front of two closed doors. Behind on eof the doors is a tiger and behind the other is a large reward. If the agent opens the door with the tiger, then a large penalty –100 is received. Conversely, the agent will get 10 reward for opening the correct door. Instead of opening one of the two doors, the agent can listen, in order to gain some information about the location of the tiger. Unfortunately, listening is not free, costs –1. We refer to the state of the world when tiger is on the left S0, and when tiger is on the right is S1. Actions can be listen, open the left door or open the right door. There are only 2 observations: to hear the tiger on the left(TL) or to hear the tiger on the right(TR). Immediately after the agent opens the door and receives a reward or penalty, the problem resets, randomly relocating the tiger behind one of the two doors. The transition and observation models can be described in detail as follows. (next slides) Reward Function - Penalty for wrong opening: -100 - Reward for correct opening: +10 - Cost for listening action: -1 Observations - to hear the tiger on the left (TL) - to hear the tiger on the right(TR) Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Problem (Transition Probabilities) Prob. (LISTEN) Tiger: left Tiger: right 1.0 0.0 Prob. (LEFT) Tiger: left Tiger: right 0.5 For transition model T(s, a, s’), we specify as follows, Listen action does not change the state the world. The Open-Left and Open-Right cause a transition to world state S0 with prob. 0.5 and to state S1 with prob 0.5 (essentially resetting the problem). Prob. (RIGHT) Tiger: left Tiger: right 0.5 Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Problem (Observation Probabilities) Prob. (LISTEN) O: TL O: TR Tiger: left 0.85 0.15 Tiger: right Prob. (LEFT) O: TL O: TR Tiger: left 0.5 Tiger: right For observation model O(s’, a, o) When the world is in state S0, the LISTEN action results in observation TL with prob. 0.85 and the observation TR with prob. 0.15. Conversely for world state S1. No matter what state the world is in , the LEFT and RIGHT actions result in either observation with prob. 0.5 Prob. (LEFT) O: TL O: TR Tiger: left 0.5 Tiger: right Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Problem (Immediate Rewards) Reward (LISTEN) Tiger: left -1 Tiger: right Reward (LEFT) Tiger: left -100 Tiger: right +10 Reward (RIGHT) Tiger: left +10 Tiger: right -100 Presentation: Planning and Acting in Partially Observable Stochastic Domains

The tiger problem: State tracking Belief vector S1 “tiger-left” S2 “tiger-right” Belief For current belief state, we listen according to policy. And then reach the new state, do some observation and compute new belief state based on action and last belief and last action. Presentation: Planning and Acting in Partially Observable Stochastic Domains

The tiger problem: State tracking Belief vector S1 “tiger-left” S2 “tiger-right” Belief obs=hear-tiger-left action=listen Presentation: Planning and Acting in Partially Observable Stochastic Domains

The tiger problem: State tracking Belief vector S1 “tiger-left” S2 “tiger-right” Belief obs=growl-left action=listen Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Example Optimal Policy t=1 Optimal Policy for t=1 0(1)=(-100.0, 10.0) 1(1)=(-1.0, -1.0) 0(1)=(10.0, -100.0) [0.00, 0.10] [0.10, 0.90] [0.90, 1.00] left listen right Belief Space: open-right open-left listen S1 “tiger-left” S2 “tiger-right” Optimal policy: Let us begin with the situation for the step t=1, when the only only gets to make a single decision. If the agent believes with high probability that the tiger is on the left, then the best action is to open the right door; If it believes ……. So we may have policy tree left and right , which left and right specify the action to take. But what if the agent is highly uncertain about the tiger’s location? The best thing to do is to listen. That’s because: Guessing incorrectly will incur a penalty of –100, whereas gussing correctly will yield a reward of +10. When the agent’s belief has no bias either way, it will guess wrong as often as it guesses right, so its expected reward for opening a door will be (-100+10)/2 = -45. While, listening always has value –1, which is still greater than the value of opening the door at random. So the 1step optimal policy are shown here. Each policy tree is shown as a node. Below is the belief interval over which the policy tree dominates; the actual  vectors are shown above.  Vectors are computed as… Belief state is They form a partition of the belief space. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Example Optimal Policy for t=2 [0.00, 0.02] [0.02, 0.39] [0.39, 0.61] [0.61, 0.98] [0.98, 1.00] listen listen listen listen listen right left listen TL/TR TR TL We now move to the case in which the agent can act for 2 steps to make decision for. The policy for t = 2 are shown in this figure. 1. We can find that there is an interesting property; it always chooses to listen. Why? There is a locial reason for this. Because if the agent were to open one of the doors at t=2; then due to the way the problem has been formulated, the tiger is then randomly placed behind of the the doors and the agent’s belief state will get to reset = [0.5, 0.5]. So after opening the door, the agent would be left with no info about the tiger’s location and with one action remaining. We just saw that with one step to go and b=(0.5,0.5) the best thing to do is listen. Therefore, It is a better strategy to listen at t=2 in order to make a more informed decision on the last step. 2. Another intersting feature found here is that there are multiple policy trees with the same action at the root. These vectors area acturally partitioning the beilef space into pieces that have structural similarities. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Reference Planning and acting in partially Observable stochastic domains Leslie P. Kaelbling Optimal Policies for partially Observable Markov Decision Processes Anthony R.Cassandra 1995 Planning and control in stochastic domains with imperfect information Milos Hauskrecht Hierarchical Methods for Planning under Uncertainty Joelle Pineau POMDP’s tutorial in Tony's POMDP Page http://www.cs.brown.edu/research/ai/pomdp/ Partial Observability Dan Bernstein 2002 Presentation: Planning and Acting in Partially Observable Stochastic Domains

Thank You.  Presentation: Planning and Acting in Partially Observable Stochastic Domains