Download presentation
Presentation is loading. Please wait.
Published byEdgar York Modified over 9 years ago
1
MDP Reinforcement Learning
2
Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
3
Charity MDP State space : 3 states Actions : “Should you give money to charity ”,“Would you contribute” Observations : knowledge of current state Rewards : in final state, positive reward proportional to amount of money gathered
4
So how can we raise the most money (maximize the reward)? I.e. What is the best policy? –Policy : optimal action for each state
5
Lecture Outline 1. Computing the Value Function 2. Finding the Optimal Policy 3. Computing the Value Function in an Online Environment
6
Useful definitions Define: to be a policy (j) : the action to take in j R(j) the reward from a certain state f(j, ) : the next state, starting from state j and performing action
7
Computing The Value Function When the reward is known, we can compute the value function for a particular policy V (j), the value function : Expected reward for being in state j, and following a certain policy
8
Calculating V (j) 1. Set V 0 (j) = 0, for all j 2. For i = 1 to Max_i V i (j) = R(j) + V (i-1) (f(j, (j))) = the discount rate, measures how much future rewards can propagate to previous states Above formula depends on the rewards being known
9
Value Fn for the Charity MDP Fixing at.5, and two policies, one which asks both questions, and the other cuts to the chase What is V 3 if : Assume that the reward is constant at the final state (everyone gives the same amount of money) 2. Assume that if you ask if one should give to charity, the reward is 10 times higher.
10
Given the value function, how can we find the policy which maximizes the rewards?
11
Policy Iteration 1. Set 0 to be an arbitrary policy 2. Set i to 0 3. Compute V i (j) for all states j 4. Compute (i+1) (j) = argmax V i (f(j, )) 5. If (i+1) = i stop, otherwise i++ and back to step 3 What would this for the charity MDP for the two cases?
12
Lecture Outline 1. Computing the Value Function 2. Finding the Optimal Policy 3. Computing the Value Function in an Online Environment
13
MDP Learning So, the rewards are known, we can calculate the optimal policy using policy iteration. But what happens in the case where we don’t know the rewards?
14
Lecture Outline 1. Computing the Value Function 2. Finding the Optimal Policy 3. Computing the Value Function in an Online Environment
15
Deterministic vs. Stochastic Update Deterministic : V i (j) = R(j) + V (i-1) (f(j, (j))) Stochastic : V (n) = (1 - ) V (n) + [r + V (n’)] Difference in that stochastic version averages over all visits to the state
16
MDP extensions Probabilistic state transitions How should you calculate the value function for the first state now? MadHappy “Would you like to contribute?” ““Would you like to contribute?”.8.2 -10 +10
17
Probabilistic Transitions Online computation strategy works the same even when state transitions are unknown Works in the case when you don’t know what the transitions are
18
Online V (j) Computation 1. For each j initialize V (j) = 0 2. Set n = initial state 3. Set r = reward in state n 4. Let n’ = f(n, (n)) 5. V (n) = (1 - ) V (n) + [r + V (n’)] 6. n = n’, and back to step 3
19
1-step Q-learning 1. Initialize Q(n,a) arbitraily 2. Select as policy 3. n = initial state, r = reward, a = (n) 4. Q(n,a) = (1 - ) Q(n,a) + [r + max a’ Q (n’,a’)] 5. n = n’, and back to step 3
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.