Partially Observable Markov Decision Process and RL

Partially Observable Markov Decision Process and RL
Professor Shiyan Hu Michigan Technological University

Markov Models 2 Markov Chain (MC)
Hidden Markov Model (HMM) Markov Decision Process (MDP) Partially Observable Markov Decision Process (POMDP) Markov Models With observation uncertainty, no decision No observation uncertainty, with decision With both observation uncertainty and decision 2

Markov Chain 3 Definition Components
For a time series s[1],𝑠[2],𝑠[3],…,𝑠[𝑇−1], s[i]∈𝑆, 𝑃 𝑠[𝑖+1] 𝑠 𝑖 ,𝑠 𝑖−1 ,…,𝑠[0] =𝑃 𝑠[𝑖+1] 𝑠[𝑖] 𝑃 s[1],𝑠[2],𝑠[3],…,𝑠[𝑇−1] >0 Components System states 𝑆={ 𝑠 0 , 𝑠 1 ,…, 𝑠 𝑁−1 } Example: {cold, warm, hot} Transition probability Example: 𝑠 0 𝑠 𝑠 1 𝑠 𝑠 𝑠 3

Hidden Markov Model (HMM)
States: s 1 →𝑠 2 →𝑠 3 →…→𝑠 𝑇−1 Observations: 𝑜 𝑜 𝑜 𝑜[𝑇−1] The system state is the underlying rule of the real world, which is not visible and can only be estimated using observations (associated with uncertainties). Components: State: 𝑠 Observation: 𝑜 State Transition Probability: 𝑠→𝑠′ Observation Probability: 𝑠→𝑜 4

An Example of HMM 5 Estimate the climate using the tree ring
States: {H(hot), C(Cold)}. Observation (size of tree ring): {L(large),M(medium), S(small)} State Transition: 𝐻 𝐶 𝐻 𝐶 Observation: 𝑆 𝑀 𝐻 𝐶 𝐿 5

Solving HMM Problems 6 State Transition: 𝐻 𝐶 𝐻 0.7 0.3 𝐶 0.4 0.6
Observation: 𝑆 𝑀 𝐻 𝐶 𝐿 The initial distribution of states is {P(H)=0.6,P(C)=0.4}. Suppose that our observation is {S,M,S,L} in four years, what are the corresponding climate? P(HHCC)=P(H)*P(S|H)*P(H|H)*P(M|H)*P(C|H)*P(S|C)*P(C|C)*P(L|C) = 0.6*0.1*0.7*0.4*0.3*0.7*0.6*0.1= We compute the probability corresponding to each possible sequence such as P(HHHH), P(HHCH), P(HCHC), … 6

Solving HMM by Enumeration
Among all the state sequences, CCCH has the largest probability. Thus it should be chosen as the estimated state sequence. This solving process needs to compute the probability of 𝑇 𝑆 sequences if 𝑇 is the length of sequence and 𝑆 is the number of system states. The complexity is exponential. Stamp, Mark. "A revealing introduction to hidden Markov models." 7

Solving HMM by Dynamic Programming
The first year observation is S P(H)=0.6*0.1=0.06, P(C)=0.4*0.7=0.28 The second year observation is M P(HH)=0.06*0.7*0.4=0.0168 P(HC)=0.06*0.3*0.2=0.0036 P(CH)=0.28*0.4*0.4=0.0448 P(CC)=0.28*0.6*0.2=0.0336 Pruned since those two sequences cannot appear in the optimal sequence Each step, we only keep two sequences with the largest probabilities among those ending with H and C. 8

Markov Decision Process (MDP)
Given the current state and state transition probability matrix, MDP is to determine the best decision which leads to the maximum expected reward There is no observation or observation uncertainties MDP RL State Transition Probability? Yes No Reward Function? Specified Not Analytically Given 9

Partially Observable Markov Decision Process (POMDP)
Given the current observation (with uncertainties) and state transition probability matrix, POMDP is to determine the best decision which leads to the maximum expected reward Model the past, model the present and predict the future (probabilistic long term reward) Three layer architecture Observation, State, Action POMDP models the interactions among them 10

A Simple Example of POMDP
𝑠 0 , 𝑜 0 : No hacking, 𝑠 1 , 𝑜 1 : Smart meter 1 is hacked, 𝑠 2 , 𝑜 2 : Smart meter 2 is hacked. 𝑠 3 , 𝑜 3 : Both smart meters are hacked. 𝑆={ 𝑠 0 , 𝑠 1 , 𝑠 2 , 𝑠 3 } 𝑂={ 𝑜 0 , 𝑜 1 , 𝑜 2 , 𝑜 3 } 𝐴={ 𝑎 0 , 𝑎 1 } 𝑎 0 : No or negligible cyberattack, 𝑎 1 : Check and fix the hacked smart meters 11

Output of POMDP: Policy Transfer Graph
𝑎 0 𝑠 0 𝑜 0 𝑜 1 , 𝑜 2 , 𝑜 3 𝑎 1 𝑠 1 Policy: a set of actions where there is a corresponding action for each possible state 12

Modeling The Past: Probabilistic State Transition Diagram
0.5| 𝑎 0 , 1| 𝑎 1 𝑠 0 Learn from historical observation data Calibrate mapping from observation to state Apply conditional probability (Bayesian rule) 0| 𝑎 0 , 1| 𝑎 1 0| 𝑎 0 , 1| 𝑎 1 0| 𝑎 0 , 1| 𝑎 1 0.2| 𝑎 0 , 0| 𝑎 1 0.1| 𝑎 0 , 0| 𝑎 1 0.2| 𝑎 0 , 0| 𝑎 1 𝑠 3 0| 𝑎 0 , 0| 𝑎 1 0| 𝑎 0 , 0| 𝑎 1 0.5| 𝑎 0 , 0| 𝑎 1 0.5| 𝑎 0 , 0| 𝑎 1 0| 𝑎 0 , 0| 𝑎 1 1| 𝑎 0 , 0| 𝑎 1 0.1| 𝑎 0 , 0| 𝑎 1 𝑠 1 𝑠 2 0.5| 𝑎 0 , 0| 𝑎 1 0.5| 𝑎 0 , 0| 𝑎 1 13

Modeling The Present Belief State: we know the current state in a probabilistic sense The probabilistic distribution over states [0.7, 0.15, 0.05, 0.1] is a belief state, meaning that 70% chance in s0, 15% in s1, 5% in s2 and 10% in s3. 14

Predict The Future Account for the Future 15

Find a Series of Actions w/ Maximum Reward in Future
Associate a reward to each action and weight it differently at different time slot. Find a series of actions leading to the maximum reward for the future k time slots. After an action, the new belief state is 𝑏 𝑠 ′ = 𝑠∈𝑆 𝑇 𝑠 ′ ,𝑎,𝑠 𝑏(𝑠) 𝑃(𝑎,𝑏) 𝑅 0 Discount Factor: 0.5 ×1 for 2pm 𝑏 𝑎 0 𝑎 1 𝑅 1 ×0.5 for 3pm < 𝑏′ 𝑏′ 𝑎 0 𝑎 1 𝑎 0 𝑎 1 𝑅 2 > 𝑏′′ 𝑏′′ × 0.25 for 4pm 𝑏′′ < 𝑏′′ 𝑎 0 𝑎 1 𝑎 0 𝑎 1 𝑎 0 𝑎 1 𝑎 0 𝑎 1 × 0.125 for 5pm 𝑏′′′ > 𝑏′′′ 𝑏′′′ < 𝑏′′′ 𝑏′′′ < 𝑏′′′ 𝑏′′′ < 𝑏′′′ 𝑅 3 16

The POMDP Formulation A POMDP problem is formulated as 𝑆,𝐴,𝑇,𝑅,Ω,𝑂
𝑆: The system state space. 𝐴: The action space. 𝑂: The observation of the system state. 𝑇( 𝑠 ′ ,𝑎,𝑠): The state transition function, defined as the probability that the system transits from state 𝑠 to 𝑠 ′ when action 𝑎 is taken. Ω(𝑜,𝑎,𝑠): The observation function, defined as the probability that the observation is 𝑜 when the state and action are 𝑠 and 𝑎 respectively. 𝑅( 𝑠 ′ ,𝑎,𝑠): The reward function, defined as the reward achieved by the decision maker, taking action 𝑎 at state 𝑠 which transits to 𝑠′. 17

Belief-State MDP Using the belief state, the POMDP problem is reduced to 𝐵,𝐴,𝜌,𝜏 𝐵: The space of belief state Given a new observation, the belief state is updated as 𝑏 𝑠 ′ =𝑃( 𝑠 ′ |𝑜,𝑎,𝑠)= Ω(𝑜,𝑎, 𝑠 ′ ) 𝑠∈𝑆 𝑇 𝑠 ′ ,𝑎,𝑠 𝑏(𝑠) 𝑃(𝑜|𝑎,𝑏) 𝜌(𝑎,𝑏): The intermediate reward for taking action 𝑎 in the belief state 𝑏 𝜌 𝑎,𝑏 = 𝑠∈𝑆 𝑠 ′ ∈𝑆 𝑏(𝑠)𝑅( 𝑠 ′ ,𝑎,𝑠)𝑇( 𝑠 ′ ,𝑎,𝑠) (1) 𝜏( 𝑏 ′ ,𝑎,𝑏): The transition function between the belief states 𝜏 𝑏 ′ ,𝑎,𝑏 =𝑃 𝑏 ′ 𝑎,𝑏 = 𝑜∈𝑂 𝑃 𝑏 ′ |𝑏,𝑎,𝑜 𝑃 𝑜|𝑎,𝑏 (2) 𝑃 𝑏 ′ |𝑏,𝑎,𝑜 = 1, 𝑖𝑓 (𝑏,𝑎,𝑜)⇒𝑏′ 0,𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Filtering (monitoring) to track belief states Stochastic and statistical filtering, e.g., Kalman filter (optimal when belief states are Gaussian, transition function is linear, and MDP is still discrete time), Extended Kalman filter or particle filter 18

Probabilistic State Transition Computation
When 𝑎 1 is taken, all the hacked smart meters are fixed. 𝑇 𝑠 𝑖 , 𝑎 1 , 𝑠 𝑗 = 1, 𝑖𝑓 𝑠 𝑖 = 𝑠 0 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3) Ω 𝑜 𝑖 , 𝑎 1 , 𝑠 𝑗 = 1, 𝑖𝑓 𝑜 𝑖 = 𝑜 0 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (4) Compute 𝑇( 𝑠 𝑖 , 𝑎 0 , 𝑠 𝑗 ) directly? The action 𝑎 0 does not change the state, so we can obtain the state transition from the observation transition. 𝑇 𝑠′, 𝑎 0 ,𝑠 = 𝑜∈𝑂 𝑜∈𝑂′ 𝑇 𝑜′, 𝑎 0 ,𝑜 𝑃 𝑠 𝑎 0 ,𝑜 𝑃 𝑠′ 𝑎 0 ,𝑜′ (5) 𝑃 𝑠 𝑎 0 ,𝑜 = 𝑃 𝑜 𝑎 0 ,𝑠 𝑃 𝑠 𝑠′∈𝑆 𝑃 𝑜 𝑎 0 ,𝑠′ 𝑃 𝑠′ (6) 𝑃 𝑠 𝑖 is approximated by 𝑃 𝑜 𝑖 19

Reward for Future POMDP aims to maximize the expected long term reward 𝐸 𝑡=0 ∞ 𝑟 𝑡 𝛾 𝑡 (Bellman’s Optimality), where 𝛾 is a discount factor to reduce the importance of the future events, and 𝑟 𝑡 is the reward achieved in step 𝑡. 𝑉 ∗ 𝑏,𝑡 =max 𝐸 𝑡=0 ∞ 𝑟 𝑡 𝛾 𝑡 =max 𝑎∈𝐴 𝜌 𝑎,𝑏 +𝛾 𝑏 ′ ∈𝐵 𝜏 𝑏 ′ ,𝑎,𝑏 𝑉 ∗ 𝑏 ′ ,𝑡+1 Reward for each action 𝑅 𝑠 𝑖 , 𝑎 0 , 𝑠 𝑗 = − 𝐶 𝐿 1 , 𝑖𝑓 𝑆 1 ∗ ≤𝑖< 𝑆 2 ∗ − 𝐶 𝐿 1 − 𝐶 𝐿 2 , 𝑖𝑓 𝑆 2 ∗ ≤𝑖 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (7) 𝑅 𝑠 𝑖 , 𝑎 1 , 𝑠 𝑗 =− 𝐶 𝐼 −(𝑗−𝑖) 𝐶 𝑅 (8) System loss when there is an undetected cyberattack Labor cost due to detection 20

𝑎 ∗ = 𝑎 1 ? No IYes 21 Obtain the training data
Obtain the Observation 𝑜 Map the observation to belief state 𝑏 Compute the belief state transition 𝜏( 𝑏 ′ ,𝑎,𝑏) according to Eqn. (2) Compute the intermediate reward function 𝜌(𝑎,𝑏)according to Eqn. (1) Solve the optimization problem P to get the optimal action 𝑎 ∗ Obtain the training data Estimate the state transition probability 𝑇( 𝑠 𝑖 , 𝑎 0 , 𝑠 𝑗 ) for action 𝑎 0 using 𝑇 𝑜′, 𝑎 0 ,𝑜 according to Eqn. (5) and Eqn. (6) Reset state transition probability 𝑇( 𝑠 𝑖 , 𝑎 1 , 𝑠 𝑗 ) and observation probability Ω( 𝑜 𝑖 , 𝑎 1 , 𝑠 𝑗 ) for 𝑎 1 from Eqn. (3) and Eqn. (4) respectively. Obtain the reward functions according to Eqn. (7) and Eqn. (8) respectively. 𝑎 ∗ = 𝑎 1 ? Apply single event defense technique on each smart meter to check the hacked smart meters and fix them. IYes No 21

POMDP Implementation 22

pomdp.m 23

recursive.m 24

Input and Output of pomdp.m
gamma is the discount factor, O is the observation function, R is the reward function and T is the state transition. A is the set of the available actions. ob is the previous belief state, oc is the current observation (given), and oa is the previous action. Output table is the expected reward of each action and b is the updated belief state. 25

Denotations in MATLAB 𝑇 𝑠 𝑗 ,𝑎, 𝑠 𝑖 → T(i,j,a)
O 𝑜 𝑗 ,𝑎, 𝑠 𝑖 → O(i,j,a) 𝑅 𝑠 𝑗 ,𝑎, 𝑠 𝑖 → R(i,j,a) 26

Belief State Update 𝑏 𝑠 ′ =𝑃( 𝑠 ′ |𝑜,𝑎,𝑠)= Ω(𝑜,𝑎, 𝑠 ′ ) 𝑠∈𝑆 𝑇 𝑠 ′ ,𝑎,𝑠 𝑏(𝑠) 𝑃(𝑜|𝑎,𝑏) 27

Recursively Compute Expected Reward
gamma is the discount factor 28

Input and Output of recursive.m
a is the action in the last step gamma is the discount factor and r is the cumulative discount factor Other inputs are defined the same as before reward is the expected reward of the subtree 29

Recursively Compute Expected Reward
Associate a reward to each action and weight it differently at different time slot. Find a series of actions leading to the maximum reward for the future k time slots. For each action, belief state is predicted by 𝑏 𝑠 ′ = 𝑠∈𝑆 𝑇 𝑠 ′ ,𝑎,𝑠 𝑏(𝑠) 𝑃(𝑎,𝑏) 𝑅 0 Discount Factor: 0.5 ×1 for 2pm 𝑏 𝑎 0 𝑎 1 𝑅 1 ×0.5 for 3pm < 𝑏′ 𝑏′ 𝑎 0 𝑎 1 𝑎 0 𝑎 1 𝑅 2 > 𝑏′′ 𝑏′′ × 0.25 for 4pm 𝑏′′ < 𝑏′′ 𝑎 0 𝑎 1 𝑎 0 𝑎 1 𝑎 0 𝑎 1 𝑎 0 𝑎 1 × 0.125 for 5pm 𝑏′′′ > 𝑏′′′ 𝑏′′′ < 𝑏′′′ 𝑏′′′ < 𝑏′′′ 𝑏′′′ < 𝑏′′′ 𝑅 3 30

Recursively Compute Expected Reward of Subtrees
𝑏 𝑠 ′ = 𝑠∈𝑆 𝑇 𝑠 ′ ,𝑎,𝑠 𝑏(𝑠) 𝑃(𝑎,𝑏) 31

Belief State Prediction
𝑏 𝑠 ′ = 𝑠∈𝑆 𝑇 𝑠 ′ ,𝑎,𝑠 𝑏(𝑠) 𝑃(𝑎,𝑏) bx=b*T(:,:,a) : for i=1:N bx(i)=0; for j=1:N bx(i)=bx(i)+b(i)*T(j,i,a); end 32

Recursive Call r*recursive(R,T,A,bx,i,gamma,r*gamma):
Compute the expected rewards of subsequent subtrees b*sum(R(:,:,a).*T(:,:,a),2): Compute the instant reward, which is the expectation of the rewards over all the possible next states 33

Detection in Smart Home Systems
Initialization Observation, Obtained from Smart Home Simulator Call POMDP for Smart Home Cyberattack Detection 34

Bottleneck of POMDP Solving
The time complexity of the POMDP formulation is exponential to the number of states. There can even be exponential number of states and thus the size of the state transition probability matrix. Speedup techniques are highly neccessary. 35

Speedup is All About Mapping
Find a series of actions w/ maximum reward in the belief state space The corresponding maximum reward is called value function V* Value function is piece-wise linear and convex. Cast a discrete POMDP with uncertainty into an MDP defined on belief states, which is continuous and potentially easier to approximate. All about mapping between b and V*(b) Value function V* Belief state space 36

Idea #1: ADP for Function and Value Approximation
Function approximation: round V*(b) Compute V*(b’) on a set of selected grid points b’ in the belief state space Perform regression to approximate V*(b) function for all other b Polynomial, RBF, Fourier, EMD RL or NN Value approximation: round b Get a set of samples B, and precompute V*(B) Given a request b, computes b' as the nearest neighbor from samples and return V*(b') Value function V* Belief state space 37

Idea #2: ADP for Policy Approximation
𝑏 𝑎 0 𝑎 1 𝑏′ < 𝑏′ 𝑎 0 𝑎 1 Reward is too small 𝑏′′ < 𝑏′′ Reward is too small 𝑎 0 𝑎 1 𝑏′′′ < 𝑏′′′ 38

Simulation Results # States 100 200 300 400 500 Baseline 5.16s 17.66s
Policy Approx. 0.10s 0.23s 0.60s 1.50s 2.37s # Time Slots 1 2 3 4 5 6 7 8 9 Baseline Policy Approx. 39

Partially Observable Markov Decision Process and RL

Similar presentations

Presentation on theme: "Partially Observable Markov Decision Process and RL"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Partially Observable Markov Decision Process and RL

Similar presentations

Presentation on theme: "Partially Observable Markov Decision Process and RL"— Presentation transcript:

Similar presentations

About project

Feedback