Markov Decision Process (MDP)

Markov Decision Process (MDP)
Ruti Glick Bar-Ilan university

Policy Policy is similar to plan
generated ahead of time Unlike traditional plans, it is not a sequence of actions that an agent must execute If there are failures in execution, agent can continue to execute a policy Prescribes an action for all the states Maximizes expected reward, rather than just reaching a goal state

Utility and Policy utility Policy Compute for every state
“What is the usage (utility) of this state for the overall task?”􀂄 Policy Complete mapping from states to actions “In which state should I perform which action?” policy: state  action

The optimal Policy π*(s) = argmaxa∑s’T(s, a, s’)U(s’) T(s, a, s’) = Probability of reaching state s’ from state s U(s’) = Utility of state s’. If we know the utility we can easily compute the optimal policy.􀂄 The problem is to compute the correct utilities for all states.

Finding π* Value iteration Policy iteration

Value iteration Process: Calculate the utility of each state
Use the values to select an optimal action

Bellman Equation Bellman Equation:
U(s) = R(s) + γmaxa ∑ (T(s, a, s’) U(s’)) For example: U(1,1) = γmax{ 0.8U(1,2)+0.1U(2,1)+0.1U(1,1), U(1,1)+0.1U(1,2), U(1,1)+0.1U(2,1), U(2,1)+0.1U(1,2)+0.1U(1,1,) } Up Left Down Right START +1 -1

Bellman Equation properties Problem Solution
U(s) = R(s) + γ maxa ∑ (T(s, a, s’) U(s’)) n equations. One for each step n vaiables Problem Operator max is not a linear operator Non-linear equations. Solution Iterative approach

Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from it’s neighbors Ui+1(s) R(s) + γ maxa ∑ (T(s, a, s’) Ui(s’)) Iteration step called Bellman update Repeat till converges

Value Iteration properties
This equilibrium is a unique solution! Can prove that the value iteration process converges Don’t need exact values

convergence Value iteration is contraction
Function of one argument When applied on to inputs produces value that are “closer together Have only one fixed point When applied the value must be closer to fixed point We’ll not going to prove last point converge to correct value

Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility function input: mdp, MDP with states S, transition model T, reward function R, discount γ local variables: U, U’ vectors of utilities for states in S, initially identical to R repeat U U’ for each state s in S do U’[s]  R[s] + γmaxa s’ T(s, a, s’)U[s’] until close-enough(U,U’) return U

Example Small version of our main example 2x2 world
The agent is placed in (1,1) States (2,1), (2,2) are goal states If blocked by the wall – stay in place The reward are written in the board R=-0.04 R=+1 R=-1

Example (cont.) First iteration U=-0.04 R=-0.04 U=+1 R=+1 U=-1 R=-1
U’(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), U(1,1) + 0.1U(1,2), U(1,1) + 0.1U(2,1), U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = x max { 0.8x(-0.04) + 0.1x(-0.04) + 0.1x(-1), x(-0.04) + 0.1x(-0.04), x(-0.04) + 0.1x(-1), x(-1) + 0.1x(-0.04) + 0.1x(-0.04)} = max{ , -0.04, , } =-0.08 U’(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2), U(1,2) + 0.1U(1,1), U(1,1) + 0.1U(2,2) + 0.1U(1,2), U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = x max {0.9x(-0.04)+ 0.1x1, x(-0.04) + 0.1x(-0.04), x(-0.04) + 0.1x x(-0.04), x x(-0.04) + 0.1x(-0.04)} = max{ 0.064, -0.04, 0.064, 0.792} =0.752 Goal states remain the same

Example (cont.) Second iteration U=0.752 R=-0.04 U=+1 R=+1 U=-0.08
U’(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), U(1,1) + 0.1U(1,2), U(1,1) + 0.1U(2,1), U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = x max { 0.8x(0.752) + 0.1x(-0.08) + 0.1x(-1), x(-0.08) + 0.1x(0.752), x(-0.08) + 0.1x(-1), x(-1) + 0.1x(-0.08) + 0.1x(0.752)} = max{ , , , } =0.4536 U’(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2), U(1,2) + 0.1U(1,1), U(1,1) + 0.1U(2,2) + 0.1U(1,2), U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = x max {0.9x(0.752)+ 0.1x1, x(0.752) + 0.1x(-0.08), x(-0.08) + 0.1x x(0.752), x x(0.752) + 0.1x(-0.08)} = max{ , , , } =

Example (cont.) Third iteration U=0.8272 R=-0.04 U=+1 R=+1 U= 0.4536
U(1,1) = R(1,1) + γ max { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), U(1,1) + 0.1U(1,2), U(1,1) + 0.1U(2,1), U(2,1) + 0.1U(1,1) + 0.1U(1,2)} = x max { 0.8x(0.8272) + 0.1x(0.4536) + 0.1x(-1), x(0.4536) + 0.1x(0.8272), x(0.4536) + 0.1x(-1), x(-1) + 0.1x(0.4536) + 0.1x(0.8272)} = max{ , 0.491, , } =0.5676 U(1,2) = R(1,2) + γ max {0.9U(1,2) + 0.1U(2,2), U(1,2) + 0.1U(1,1), U(1,1) + 0.1U(2,2) + 0.1U(1,2), U(2,2) + 0.1U(1,2) + 0.1U(1,1)} = x max {0.9x(0.8272)+ 0.1x1, x(0.8272) + 0.1x(0.4536), x(0.4536) + 0.1x x(0.8272), x x(0.8272) + 0.1x(0.4536)} = max{ , , , } =

Example (cont.) Continue to next iteration… Finish if “close enough”
Here last change was – close enough U= R=-0.04 U=+1 R=+1 U= U=-1 R=-1

“close enough” We will not go down deeply to this issue!
Different possibilities to detect convergence: RMS error –root mean square error of the utility value compare to the correct values demand of RMS(U, U’) < ε when: ε – maximum error allowed in utility of any state in an iteration

“close enough” (cont.) || Ui+1 – Ui || < ε (1-γ) / γ
Policy Loss : difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 – Ui || < ε (1-γ) / γ When: ||U|| = maxa |U(s)| ε – maximum error allowed in utility of any state in an iteration γ – the discount factor

Finding the policy True utilities have founded
New search for the optimal policy: For each s in S do π[s]  argmaxa ∑s’ T(s, a, s’)U(s’) Return π

Example (cont.) Find the optimal police
Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), //Up U(1,1) + 0.1U(1,2), //Left U(1,1) + 0.1U(2,1), //Down U(2,1) + 0.1U(1,1) + 0.1U(1,2)} //Right = argmaxa { 0.8x(0.8881) + 0.1x(0.5676) + 0.1x(-1), x(0.5676) + 0.1x(0.8881), x(0.5676) + 0.1x(-1), x(-1) + 0.1x(0.5676) + 0.1x(0.8881)} = argmaxa { , , , } = Up Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2), //Up U(1,2) + 0.1U(1,1), //Left U(1,1) + 0.1U(2,2) + 0.1U(1,2), //Down U(2,2) + 0.1U(1,2) + 0.1U(1,1)} //Right = argmaxa { 0.9x(0.8881)+ 0.1x1, x(0.8881) + 0.1x(0.5676), x(0.5676) + 0.1x x(0.8881), x x(0.8881) + 0.1x(0.5676)} = argmaxa {0.8993, , , } = Right U= R=-0.04 U=+1 R=+1 U= U=-1 R=-1

Summery – value iteration
+1 -1 +1 -1 0.705 0.655 0.611 0.388 0.762 0.660 0.912 0.868 0.812 1. The given environment 2. Calculate utilities +1 -1 +1 -1 With the value iteration algorithm we can construct the a set of 11 linear equations in 11 unknowns, which can be solved by linear algebra method. 4. Execute actions 3. Extract optimal policy

Example - convergence Error allowed

Policy iteration picking a policy, then calculating the utility of each state given that policy (value iteration step) Update the policy at each state using the utilities of the successor states Repeat until the policy stabilize

Policy iteration For each state in each step Policy evaluation
Given policy πi. Calculate the utility Ui of each state if π were to be execute Policy improvement Calculate new policy πi+1 Based on πi π i+1[s]  argmaxa ∑s’ T(s, a, s’)U π_i (s’)

Policy iteration Algorithm
function POLICY_ITERATION (mdp) returns a policy input: mdp, an MDP with states S, transition model T local variables: U, U’ vectors of utilities for states in S, initially identical to R π, a policy, vector indexed by states, initially random repeat U Policy-Evaluation(π, mdp) unchanged?  true for each state s in S do if maxa s’ T(s, a, s’) U[s’] > s’T(s, π[s], s’) U[s’] then π[s]  argmaxa s’ T(s, a, s’) U[s’] unchanged?  false end until unchanged? return π

Example Back to our last example… 2x2 world
The agent is placed in (1,1) States (2,1), (2,2) are goal states If blocked by the wall – stay in place The reward are written in the board Initial policy: Up (for every step) R=-0.04 R=+1 R=-1

Example (cont.) First iteration – policy evaluation U=-0.04 R=-0.04
U(1,1) = R(1,1) + γ x (0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1)) U(1,2) = R(1,2) + γ x (0.9U(1,2) + 0.1U(2,2)) U(2,1) = R(2,1) U(2,2) = R(2,2) U(1,1) = U(1,2) + 0.1U(1,1) + 0.1U(2,1) U(1,2) = U(1,2) + 0.1U(2,2) U(2,1) = -1 U(2,2) = 1 0.04 = -0.9U(1,1) + 0.8U(1,2) + 0.1U(2,1) + 0U(2,2) 0.04 = 0U(1,1) – 0.1U(1,2) + 0U(2,1) U(2,2) = 0U(1,1) + 0U(1,2) U(2,1) U(2,2) = 0U(1,1) + 0U(1,2) U(2,1) U(2,2) U(1,1) = U(1,2) = U=-0.04 R=-0.04 U=+1 R=+1 U=-1 R=-1 Policy Π(1,1) = Up Π(1,2) = Up

Example (cont.) First iteration – policy improvement U= 0.6 R=-0.04
Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), //Up U(1,1) + 0.1U(1,2), //Left U(1,1) + 0.1U(2,1), //Down U(2,1) + 0.1U(1,1) + 0.1U(1,2)} //Right = argmaxa { 0.8x(0.6) + 0.1x(0.3778) + 0.1x(-1), x(0.3778) + 0.1x(0.6), x(0.3778) + 0.1x(-1), x(-1) + 0.1x(0.3778) + 0.1x(0.6)} = argmaxa { , 0.4, 0.24, } = Up  don’t have to update Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2), //Up U(1,2) + 0.1U(1,1), //Left U(1,1) + 0.1U(2,2) + 0.1U(1,2), //Down U(2,2) + 0.1U(1,2) + 0.1U(1,1)} //Right = argmaxa { 0.9x(0.6)+ 0.1x1, x(0.6) + 0.1x(0.3778), x(0.3778) + 0.1x x(0.6), x x(0.6) + 0.1x(0.3778)} = argmaxa { 0.64, , , } = Right  update Policy Π(1,1) = Up Π(1,2) = Up

Example (cont.) Second iteration – policy evaluation
U(1,1) = R(1,1) + γ x (0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1)) U(1,2) = R(1,2) + γ x (0.1U(1,2) + 0.8U(2,2) + 0.1U(1,1)) U(2,1) = R(2,1) U(2,2) = R(2,2) U(1,1) = U(1,2) + 0.1U(1,1) + 0.1U(2,1) U(1,2) = U(1,2) + 0.8U(2,2) + 0.1U(1,1) U(2,1) = -1 U(2,2) = 1 0.04 = -0.9U(1,1) + 0.8U(1,2) + 0.1U(2,1) + 0U(2,2) 0.04 = 0.1U(1,1) – 0.9U(1,2) + 0U(2,1) U(2,2) = 0U(1,1) + 0U(1,2) U(2,1) U(2,2) = 0U(1,1) + 0U(1,2) U(2,1) U(2,2) U(1,1) = U(1,2) = U=0.6 R=-0.04 U=+1 R=+1 U=0.3778 U=-1 R=-1 Policy Π(1,1) = Up Π(1,2) = Right

Example (cont.) Second iteration – policy improvement
Π(1,1) = argmaxa { 0.8U(1,2) + 0.1U(1,1) + 0.1U(2,1), //Up U(1,1) + 0.1U(1,2), //Left U(1,1) + 0.1U(2,1), //Down U(2,1) + 0.1U(1,1) + 0.1U(1,2)} //Right = argmaxa { 0.8x(0.7843) + 0.1x(0.5413) + 0.1x(-1), x(0.5413) + 0.1x(0.7843), x(0.5413) + 0.1x(-1), x(-1) + 0.1x(0.5413) + 0.1x(0.7843)} = argmaxa { , , , } = Up  don’t have to update Π(1,2) = argmaxa { 0.9U(1,2) + 0.1U(2,2), //Up U(1,2) + 0.1U(1,1), //Left U(1,1) + 0.1U(2,2) + 0.1U(1,2), //Down U(2,2) + 0.1U(1,2) + 0.1U(1,1)} //Right = argmaxa { 0.9x(0.7843)+ 0.1x1, x(0.7843) + 0.1x(0.5413), x(0.5413) + 0.1x x(0.7843), x x(0.7843) + 0.1x(0.5413)} = argmaxa {0.8059, 0.76, , } = Right  don’t have to update U=0.7843 R=-0.04 U=+1 R=+1 U=0.5413 U=-1 R=-1 Policy Π(1,1) = Up Π(1,2) = Right

Example (cont.) No change in the policy has found  finish
The optimal policy: π(1,1) = Up π(1,2) = Right Policy iteration must terminate since policy’s number is finite

Simplify Policy iteration
Can focus of subset of state Find utility by simplified value iteration: Ui+1(s) = R(s) + γ ∑s’ (T(s, π(s), s’) Ui(s’)) OR Policy Improvement Guaranteed to converge under certain conditions on initial polity and utility values

Policy Iteration properties
Linear equation – easy to solve Fast convergence in practice Proved to be optimal

Value vs. Policy Iteration
Which to use: Policy iteration is more expensive per iteration In practice, Policy iteration requires fewer iterations

Reinforcement Learning:
An Introduction Richard S. Sutton and Andrew G. Barto A Bradford Book The MIT Press Cambridge, Massachusetts London, England

Markov Decision Process (MDP)

Similar presentations

Presentation on theme: "Markov Decision Process (MDP)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Markov Decision Process (MDP)

Similar presentations

Presentation on theme: "Markov Decision Process (MDP)"— Presentation transcript:

Similar presentations

About project

Feedback