Download presentation
Presentation is loading. Please wait.
1
An Introduction to PO-MDP Presented by Alp Sardağ
2
MDP Components: –State –Action –Transition –Reinforcement Problem: –choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution Solution: –Policy: value function
3
Definition Horizon length Value Iteration: –Temporal Difference Learning: Q(x,a) Q(x,a) + (r+ max b Q(y,b) - Q(x,a)) Q(x,a) Q(x,a) + (r+ max b Q(y,b) - Q(x,a)) where learning rate and discount rate. Adding PO to CO-MDP is not trivial: –Requires the complete observability of the state. –PO clouds the current state.
4
PO-MDP Components: –States –Actions –Transitions –Reinforcement –Observations
5
Mapping in CO-MDP & PO-MDP In CO-MDPs, mapping is from states to actions. In PO-MDPs, mapping is from probability distributions (over states) to actions.
6
VI in CO-MDP & PO-MDP In a CO-MDP, –Track our current state –Update it after each action In a PO-MDP, –Probability distribution over states –Perform an action and make an observation, then update the distribution
7
Belief State and Space Belief State: probability distribution over states. Belief Space: the entire probability space. Example: –Assume two state PO-MDP. –P(s 1 ) = p & P(s 2 ) = 1-p. –Line become hyper-plane in higher dimension. s1s1
8
Belief Transform Assumption: –Finite action –Finite observation –Next belief state = T(cbf,a,o) where cbf: current belief state, a:action, o:observation Finite number of possible next belief state
9
PO-MDP into continuous CO-MDP The process is Markovian, the next belief state depends on: –Current belief state –Current action –Observation Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.
10
Problem Using VI in continuous state space. No nice tabular representation as before.
11
PWLC Restrictions on the form of the solutions to the continuous space CO-MDP: –The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. –the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function
12
Steps in VI Represent the value function for each horizon as a set of vectors. –Overcome how to represent a value function over a continuous space. Find the vector that has the largest dot product with the belief state.
13
PO-MDP Value Iteration Example Assumption: –Two states –Two actions –Three observations Ex: horizon length is 1. b=[0.25 0.75] [ s1s2s1s2 a 1 a 2 ] 1 0 0 1.5 V(a 1,b) = 0.25x1+0.75x0 = 0.25 V(a 2,b)=0.25x0+0.75x1.5=1.125 a 1 is the best a 2 is the best
14
The value of a belief state for horizon length 2 given b,a 1,z 1 : –immediate action plus the value of the next action. –Find best achievable value for the belief state that results from our initial belief state b when we perform action a 1 and observe z 1. PO-MDP Value Iteration Example
15
Find the value for all the belief points given this fixed action and observation. The Transformed value function is also PWLC.
16
How to compute the value of a belief state given only the action? The horizon 2 value of the belief state, given that: –Values for each observation: z 1 : 0.7 z 2 : 0.8 z 3 : 1.2 –P(z 1 | b,a 1 )=0.6; P(z 2 | b,a 1 )=0.25; P(z 3 | b,a 1 )=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835 PO-MDP Value Iteration Example
17
Transformed Value Functions Each of these transformed functions partitions the belief space differently. Best next action to perform depends upon the initial belief state and observation.
18
Best Value For Belief States The value of every single belief point, the sum of: –Immediate reward. –The line segments from the S() functions for each observation's future strategy. since adding lines gives you lines, it is linear.
19
All the useful future strategies are easy to pick out: Best Strategy for any Belief Points
20
Value Function and Partition For the specific action a 1, the value function and corresponding partitions:
21
Value Function and Partition For the specific action a 2, the value function and corresponding partitions:
22
Which Action to Choose? put the value functions for each action together to see where each action gives the highest value.
23
Compact Horizon 2 Value Function
24
Value Function for Action a 1 with a Horizon of 3
25
Value Function for Action a 2 with a Horizon of 3
26
Value Function for Both Action with a Horizon of 3
27
Value Function for Horizon of 3
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.