Download presentation
1
Markov Decision Process (MDP)
Ruti Glick Bar-Ilan university
2
example Agent is situate in 4x3 environment
Each step have to choose action Possible actions: Up, Down, Left, Right Terminates when reaching goal state might be more than one goal state Each goal state have weight Full observable Agent always know where it is START +1 -1
3
Example (cont.) In deterministic environment solution easy:
[Up, Up, Right, Right, Right] [Right, Right, Up, Up, Right] START +1 -1 START +1 -1
4
Example (cont.) But… Action are unreliable
intended effect achieved only with probability 0.8 Probability of 0.1 getting right Probability of 0.1 getting left If bumps into wall – stays in place 0.8 0.1
5
Example (cont.) by executing the sequence [Up, Up, Right, Right, Right]: Chance of following the desired path: 0.8*0.8*0.8* 0.8*0.8 = 0.85 = Chance of accidentally get to goal from the other path: 0.1*0.1*0.1* 0.1*0.8 = 0.14 * 0.81 = Total of only to get to desired goal START +1 -1 START +1 -1
6
Transition Model Specification of outcome probabilities of each action in each possible state T(s, a, s’) = Probability of reaching s’ if action a is done in state s Can be described as 3 dimensional table Markov assumption: The next state’s conditional probability depends only on its immediately previous state
7
Reward Positive of negative reward that agent receives in state s
Sometimes reward is associated only with state R(S) Sometimes reward is assumed associated with state and action R(S, A) Sometimes, reward associated with state, action and destination-state R(S,A,J) In our example: R([4,3]) = +1 R([4,2]) = -1 R(s) = -0.04, s ≠ [4,3] and s ≠ [4,2] Can be seen as the desired of agent staying in game
8
Environment history Decision problem is sequential
Utility function depend on sequence of state Utility function is sum of rewards received In our example: If reached (4,3) after 10 steps than total utility= 1+10*(-0.04) = 0.6
9
Markov Decision Process (MDP)
The specification of a sequential decision problem for a fully observable environment that satisfies the Markov Assumption and yields additive rewards. Defined as a tuple: <S, A, P, R> S: State A: Action P: Transition function Table P(s’| s, a), prob of s’ given action a in state s R: Reward R(s) = cost or reward being in state s
10
In our example… S: State of the agent on the grid
Note that cell denoted by (x,y) A: Actions of the agent i.e. Up, Down, Left, Right P: Transition function Table P(s’| s, a), prob of s’ given action “a” in state “s” E.g., P( (4,3) | (3,3), Up) = 0.1 E.g., P((3, 2) | (3,3), Up) = 0.8 R: Reward R(3, 3) = -0.04 R (4, 3) = +1
11
Solution to MDP Notation:
In deterministic processes, solution is a plan. In observable stochastic processes, solution is a policy Policy: a mapping from S to A A policy’s quality is measured by its EU Notation: π ≡ a policy π(s) ≡ the recommended action in state s π* ≡ the optimal policy (maximum expected utility)
12
Following a Policy Procedure: 1. Determine current state s
2. Execute action π(s) 3. Repeat 1-2
13
Optimal policy for our example
The reward is small relative to -1. prefer to go around then falling into -1. +1 -1 R(4,3) = +1 R(4,2) = -1 R(S) = -0.04 START
14
Optimal Policies in our example
+1 +1 -1 -1 START START R(Start) < < R(Start) < Life so painful – the agent run to the nearest exit Life unpleasant – the agent trying to get +1, willing to to risk and fall into -1
15
Optimal Policies in our example
+1 +1 -1 -1 START START < R(S) < 0 R(s) > 0 Life is nice – the agent takes no risks at all Agent wants to stay at game
16
Decision Epoch Finite horizon Infinite horizon
After fixed time N the game is over. Nothing matters. Uh([s0, s1, …, sN+k]) = Uh([s0, s1, …, sN]), for all k>0 Optimal action might change over time Infinite horizon no fixed deadline No reason to behave differently in same state at different time We will discuss this case
17
example If agent in (1,3). what will it do?
+1 -1 Agent here If agent in (1,3). what will it do? In Finite horizon where N=3 will go up In Infinite horizon – depend on other parameters
18
Assigning Utility to Sequences
Additive reward Uh([s0, s1, s2, …]) = R(s0) + R(s1) + R(s2) + … In our last example we used this method Discounted Factor Uh([s0, s1, s2, …]) = R(s0) + γ R(s1) + γ2R(s2) + … 0 < γ < 1 γrepresent the chance the world will continue exist We will assume discounter reward
19
additive reward Problem Solutions
For infinite horizon if the agent never get to terminate state, the utility will be infinite. Can’t compare between +∞ and +∞ Solutions With discount reward the utility of infinite sequence is finite Uh([s0, s1, s2, …]) = Proper policy = policy that guaranteed to get to finite state Compare infinite sequences in term of average reward
20
conclusion Discount reward is the best solution for infinite horizon
Policy π represent a group of possible sequences Specific probability of each case Value of policy is the expected sum of all possible state sequences
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.