Download presentation
Presentation is loading. Please wait.
Published byVanessa Richard Modified over 6 years ago
1
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Artificial Intelligence: Representation and Problem Solving Sequential Decision Making (3): Passive Reinforcement Learning / 681 Instructors: Fei Fang (This Lecture) and Dave Touretzky Wean Hall 4126 12/8/2018
2
Recap MDP: (𝑆,𝐴,𝑃,𝑅) Policy 𝜋(𝑠):𝑆→𝐴 if deterministic policy
Find optimal policy: value iteration or policy iteration You know exactly how the world works! Fei Fang 12/8/2018
3
What is Reinforcement Learning Passive RL
Outline What is Reinforcement Learning Passive RL Model-based Passive RL Model-free Passive RL Direct Utility Estimation Temporal Difference Learning Fei Fang 12/8/2018
4
What is Reinforcement Learning
Reinforcement Learning: Learn an optimal policy for the environment without having a complete model Learn through Trial and Error Don’t have a simulator! Have to actually learn what happens if take an action in a state Fei Fang 12/8/2018
5
What is Reinforcement Learning
Reinforcement Learning: Learn an optimal policy for the environment without having a complete model RL: Don’t know 𝑃 or 𝑅 (or just hard to enumerate) Goal: Find policy 𝜋 MDP: (𝑆,𝐴,𝑃,𝑅) Goal: Find policy 𝜋 Fei Fang 12/8/2018
6
What is Reinforcement Learning
The agents can ”sense” the environment (it knows the state) and has goals Learning from interaction with the environment Trial and error search (Delayed) Rewards (Advisory signals ≠ errors signals) What actions to take Exploration-Exploitation dilemma Fei Fang 12/8/2018
7
RL Applications / Examples
Bipedal Robot Learn to Walk The following video shows research highlights from Toddler, a simple 3D dynamic biped that was able to quickly and reliably learn to walk. The beginning of the video demonstrates the robot's ability to walk passively downhill on a treadmill with the computer turned off. The robot was then placed on flat terrain with the computer switched on, and tasked with acquiring the same gait without assistance from gravity, but rather by learning a feedback controller. This learning occurred in less than 20 minutes, using only trials implemented on the real robot (no simulations). The learning algorithm continues to quickly adapt as the robot walks over different terrain. Sebastian Seung Lab at MIT. Fei Fang 12/8/2018
8
RL Applications / Examples
Helicopter Manoeuvres Fei Fang 12/8/2018
9
RL Applications / Examples
Learn to Play Atari Games Fei Fang 12/8/2018
10
What is Reinforcement Learning Passive RL
Outline What is Reinforcement Learning Passive RL Model-based Passive RL Model-free Passive RL Direct Utility Estimation Temporal Difference Learning Fei Fang 12/8/2018
11
Passive RL Passive Learning: Agent’s policy 𝜋 is fixed; Learn state value 𝑈(𝑠) without knowing the transition model 𝑃( 𝑠 ′ |𝑠,𝑎) or the reward function 𝑅(𝑠) ahead of time Recall in policy iteration, we have learned how to evaluate a policy, i.e., compute 𝑉(𝑠) given 𝑃( 𝑠 ′ |𝑠,𝑎) and 𝑅(𝑠) Fei Fang 12/8/2018
12
Passive Reinforcement Learning
Two Approaches Model-based: Build a model then evaluate Model-free: Directly evaluate without building a model Transition Model? P(s’|a,s)=?, R(s)=?,… State Action Reward model? Given policy 𝜋 Remember, we know 𝑆 and 𝐴, just not 𝑃( 𝑠 ′ |𝑠,𝑎) and 𝑅(𝑠). Agent Fei Fang 12/8/2018
13
What is Reinforcement Learning Passive RL
Outline What is Reinforcement Learning Passive RL Model-based Passive RL Model-free Passive RL Direct Utility Estimation Temporal Difference Learning Fei Fang 12/8/2018
14
Model-Based Passive Reinforcement Learning
Follow policy 𝜋, perform many trials/experiments to get sample sequences Estimate MDP model parameters 𝑃( 𝑠 ′ |𝑠,𝑎) and 𝑅(𝑠) given observed transitions and rewards If finite set of states and actions, can just count and average counts Use estimated MDP to evaluate policy 𝜋 Fei Fang 12/8/2018
15
Model-Based Passive Reinforcement Learning
You are in an environment (you may not know the details) You are given a policy 𝜋 Fei Fang 12/8/2018
16
Example Let’s start the trials! Environment Start at (1,1) Policy
Fei Fang 12/8/2018
17
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Policy Fei Fang 12/8/2018
18
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) Policy Fei Fang 12/8/2018
19
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Policy Fei Fang 12/8/2018
20
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) Tried but didn’t move Policy Fei Fang 12/8/2018
21
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Try again Policy Fei Fang 12/8/2018
22
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) Policy Fei Fang 12/8/2018
23
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Fei Fang 12/8/2018
24
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) Fei Fang 12/8/2018
25
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Fei Fang 12/8/2018
26
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Fei Fang 12/8/2018
27
Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Estimate 𝑃( 𝑠 ′ |𝑠,𝑎): 𝑃 2,1 2,1 ,𝑡𝑟𝑖𝑔ℎ𝑡 = 1 2 Fei Fang 12/8/2018
28
Example Let’s start the trials! Environment We can run more trials!
Trial 1: 1,1 → 2,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State with reward −1, intermediate steps returns reward −0.01 Policy Fei Fang 12/8/2018
29
Example Let’s start the trials! Environment We can run more trials!
Trial 1: 1,1 → 2,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State with reward −1, intermediate steps returns reward −0.01 Trial 2: 1,1 → 1,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State with reward −1, intermediate steps returns reward −0.01 Policy Trial 3: 1,1 → 2,1 → 3,1 → 3,2 → (4,2) Terminal State with reward −1, intermediate steps returns reward −0.01 Estimate 𝑃 2,1 2,1 ,𝑡𝑟𝑖𝑔ℎ𝑡 = 1 4 Estimate 𝑅 2,1 =−0.01 Fei Fang 12/8/2018
30
Model-Based Passive Reinforcement Learning
Empirical estimate of transition probability 𝑃 𝑠 ′ 𝑠,𝑎 = 𝑁 𝑠 𝑖 =𝑠, 𝑎 𝑖 =𝑎, 𝑠 𝑖+1 = 𝑠 ′ 𝑁( 𝑠 𝑖 =𝑠, 𝑎 𝑖 =𝑎) 𝑅 𝑠 = 𝑠 𝑖 =𝑠 𝑟(𝑠) 𝑁( 𝑠 𝑖 =𝑠) Does the trials give us all the parameters for a MDP? Fei Fang 12/8/2018
31
Model-Based Passive Reinforcement Learning
Environment Recall the first trial Start at (1,1) 𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Estimate 𝑃( 𝑠 ′ |𝑠,𝑎): 𝑃 2,1 2,1 ,𝑡𝑢𝑝 =? Never seen Fei Fang 12/8/2018
32
Model-Based Passive Reinforcement Learning
Environment Recall all trials Trial 1: 1,1 → 2,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State with reward −1, intermediate steps returns reward −0.01 Trial 2: 1,1 → 1,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State with reward −1, intermediate steps returns reward −0.01 Policy Trial 3: 1,1 → 2,1 → 3,1 → 3,2 → (4,2) Terminal State with reward −1, intermediate steps returns reward −0.01 Estimate 𝑃( 𝑠 ′ |𝑠,𝑎): 𝑃 2,1 2,1 ,𝑡𝑢𝑝 =? Never taken due to 𝜋 Fei Fang 12/8/2018
33
Model-Based Passive Reinforcement Learning
Given estimates of the transition model 𝑃 and reward model 𝑅, we can do MDP policy evaluation to compute the value of our policy (Exact) Solve linear equations 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝜋 ( 𝑠 ′ ) (Estimate) Run a few iterations of simplified Bellman update 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) Do missing values matter for computing policy value? Fei Fang 12/8/2018
34
Exercise Environment If you complete 10 trials based on policy 𝜋, would the computed state value given policy 𝜋, i.e., 𝑈 𝜋 (𝑠), likely to be correct? Policy Fei Fang 12/8/2018
35
Model-Based Passive Reinforcement Learning
Advantage: Make good use of data Disadvantage: Require building the actual MDP model. Intractable if state space is too large Fei Fang 12/8/2018
36
What is Reinforcement Learning Passive RL
Outline What is Reinforcement Learning Passive RL Model-based Passive RL Model-free Passive RL Direct Utility Estimation Temporal Difference Learning Fei Fang 12/8/2018
37
Passive Reinforcement Learning
Two Approaches Model-based: Build a model then evaluate Model-free: Directly evaluate without building a model Transition Model? 𝑈 𝜋 (𝑠)=? State Action Reward model? Given policy 𝜋 Remember, we know 𝑆 and 𝐴, just not 𝑃( 𝑠 ′ |𝑠,𝑎) and 𝑅(𝑠). Agent Fei Fang 12/8/2018
38
Model-Free Passive Reinforcement Learning
Environment Recall the first trial 𝛾=1 Start at (1,1) 𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Estimate 𝑈 𝜋 (𝑠): 𝑈 𝜋 (4,2) =? 𝑈 𝜋 (3,2) =? Fei Fang 12/8/2018
39
What is Reinforcement Learning Passive RL
Outline What is Reinforcement Learning Passive RL Model-based Passive RL Model-free Passive RL Direct Utility Estimation Temporal Difference Learning Fei Fang 12/8/2018
40
Direct Utility Estimation
Reward-to-go: Expected total reward from that state onward When a trial hits a state, view it as a sample of the total reward from that state onward Compute the average as the reward-to-go 𝑈 (𝑠)=Reward from that state onward with discount( 𝑠 𝑖 =𝑠)/𝑁( 𝑠 𝑖 =𝑠) Fei Fang 12/8/2018
41
Example Recall the first trial Environment 𝛾=1 Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Estimate 𝑈 𝜋 ((2,1)): Fei Fang 12/8/2018
42
Estimate 𝑈 𝜋 (2,1) given 𝛾=0 or 0.5?
Example Environment Estimate 𝑈 𝜋 (2,1) given 𝛾=0 or 0.5? Start at (1,1) 𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Fei Fang 12/8/2018
43
Direct Utility Estimation
Disadvantage: Need to wait until you reach terminal state Estimate 𝑈(𝑠) and 𝑈( 𝑠 ′ ) separately, ignoring the relation Converge very slowly 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝜋 ( 𝑠 ′ ) Fei Fang 12/8/2018
44
What is Reinforcement Learning Passive RL
Outline What is Reinforcement Learning Passive RL Model-based Passive RL Model-free Passive RL Direct Utility Estimation Temporal Difference Learning Fei Fang 12/8/2018
45
Temporal-Different Learning (TD Learning)
Key ideas: Do not wait until the trial terminates, update 𝑈(𝑠) after each state transition Use running average, i.e., balancing previous estimate and the latest sample Naturally, more likely outcome 𝑠′ will contribute to the update more often Recall how we estimate state value given the model and policy Using simplified Bellman update Update for a subset of states only 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) Fei Fang 12/8/2018
46
Temporal-Different Learning
Suppose we are in one step of a trial: agent is in state 𝑠, take action 𝑎=𝜋(𝑠), get reward 𝑟 end up in state 𝑠′ Can we update just for 𝑠 using simplified Bellman update? No, because we don’t know 𝑃 𝑠 ′′ 𝑠,𝑎 However, 𝑠′ is sampled from the distribution 𝑃( 𝑠 ′′ |𝑠,𝑎). Can we directly estimate 𝑈 𝑖+1 𝑠 from this sample? Not good. A single sample can have very high variance. Value of 𝑈 𝑖+1 𝑠 can be very different from 𝑈 𝑖 (𝑠) TD Learning: Make use of 𝑈 𝑖 (𝑠) to “smooth” the update Running average Fei Fang 12/8/2018
47
Exponential Moving Average
𝑈 𝑖+1 𝑠 ← 1−𝛼 𝑈 𝑖 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑖 𝑠 ′ ) Let 𝑥 0 =0, 𝑥 𝑖+1 = 1−𝛼 𝑥 𝑖 +𝛼 𝑥 𝑖 Then Makes more recent samples more important. Forgets about past. Fei Fang 12/8/2018
48
Temporal-Different Learning
TD Algorithm Initialize the estimate of 𝑈(𝑠) as 𝑈 (𝑠)←0,∀𝑠 Repeat (for each episode/trial) Initialize state 𝑠 Repeat (for each step of episode) Take action 𝑎=𝜋 𝑠 Observe reward 𝑟 and next state 𝑠′ Update 𝑈 𝑠 as 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) 𝑠←𝑠′ Until 𝑠 is terminal state Observe reward 𝑟 of terminal state Update 𝑈 𝑠 as 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼×𝑟 Until 𝐾 episodes/trial are run Return 𝑈 (𝑠) 𝛼∈[0,1] is a fixed parameter determining how much “weight” we given to the old value 𝑈 𝑠 and to the new sample 𝑟+𝛾 𝑈 𝑠 ′ Fei Fang 12/8/2018
49
Example Initialize 𝑈 (𝑠) to be 0 𝛼=0.1,𝛾=1
𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Initialize 𝑈 (𝑠) to be 0 State 𝑈 (𝑠) (1,1) (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Policy Fei Fang 12/8/2018
50
Example 𝛼=0.1,𝛾=1 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment
𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Start at (1,1) 𝑠=(1,1) action=tright State 𝑈 (𝑠) (1,1) (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Reward=−0.01; End up at 𝑠′=(2,1) 𝑈 1,1 =0.9∗0+0.1∗ − =−0.001 Policy Fei Fang 12/8/2018
51
Example 𝛼=0.1,𝛾=1 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment
𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Start at (1,1) 𝑠=(1,1) action=tright State 𝑈 (𝑠) (1,1) -0.001 (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(2,1) Policy 𝑈 2,1 =0.9∗0+0.1∗ − =−0.001 Fei Fang 12/8/2018
52
Example 𝛼=0.1,𝛾=1 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment
𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Start at (1,1) 𝑠=(1,1) action=tright State 𝑈 (𝑠) (1,1) -0.001 (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(3,1) Policy 𝑈 2,1 =0.9∗− ∗ − =−0.0019 Fei Fang 12/8/2018
53
Example 𝛼=0.1,𝛾=1 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment
𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Start at (1,1) 𝑠=(1,1) action=tright State 𝑈 (𝑠) (1,1) -0.001 (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(3,1) Policy 𝑠=(3,1) action=tright Reward=−0.01; End up at 𝑠′=(4,1) 𝑈 3,1 =0.9∗0+0.1∗ − =−0.001 Fei Fang 12/8/2018
54
After the first trial, what is 𝑈 4,1 ?
Quiz 1 𝛼=0.1,𝛾=1 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment After the first trial, what is 𝑈 4,1 ? Start at (1,1) 𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate A: ; B: -0.01; C: -1; D: Fei Fang 12/8/2018
55
Temporal-Different Learning
Impact of 𝛼 In practice, we often decrease the value of 𝛼 as the learning progresses Set a counter 𝑁[𝑠] for each state, representing how many time the state has been visited 𝛼 changes as 𝑁[𝑠] increases, i.e., 𝛼 is a function of 𝑁[𝑠] 𝑈 𝑠 ← 1−𝛼(𝑁[𝑠]) 𝑈 𝑠 +𝛼(𝑁[𝑠])(𝑟+𝛾 𝑈 𝑠 ′ ) Fei Fang 12/8/2018
56
Temporal-Different Learning
Also, we may set 𝑈 (𝑠) to be the sample reward when we see 𝑠 for the first time (see textbook) TD Algorithm Initialize the estimate of 𝑈(𝑠) as 𝑈 (𝑠)←0,∀𝑠 Repeat (for each episode/trial) Initialize state 𝑠 Repeat (for each step of episode) Take action 𝑎=𝜋 𝑠 Observe reward 𝑟 and next state 𝑠′ Update 𝑈 𝑠 as 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) 𝑠←𝑠′ Until 𝑠 is terminal state Observe reward 𝑟 of terminal state Update 𝑈 𝑠 as 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼×𝑟 Until 𝐾 episodes/trial are run Return 𝑈 (𝑠) Fei Fang 12/8/2018
57
Reinforcement Learning Passive RL
Summary Reinforcement Learning Passive RL Model-based Passive RL Model-free Passive RL Direct Utility Estimation Temporal Difference Learning Fei Fang 12/8/2018
58
Summary Reinforcement Learning (RL) Passive RL Model-free Passive RL
Model-based Passive RL Direct Utility Estimation Estimate 𝑃 and 𝑅 through sampling TD Learning Fei Fang 12/8/2018
59
Next lecture: Active RL
Passive RL Policy 𝜋 is given However… Agent wants to ultimately learn to act to gather high reward in the environment. Using a deterministic policy, gives agent no experience for other actions (that are not included in the policy) Next lecture: Active RL Agent decides what action to take with the goal of learning an optimal policy Fei Fang 12/8/2018
60
Acknowledgment Some slides are borrowed from previous slides made by Tai Sing Lee and Zico Kolter Fei Fang 12/8/2018
61
Recap of Value Iteration and Policy Iteration
Backup Slides Recap of Value Iteration and Policy Iteration Fei Fang 12/8/2018
62
Markov Decision Process
Recap Markov Decision Process State 𝑆 Action 𝐴 Sometimes different states have different available actions; Denote as 𝐴(𝑠) Markovian transition model 𝑃( 𝑠 ′ |𝑠,𝑎) Reward function 𝑅(𝑠) or 𝑅(𝑠,𝑎) or 𝑅 𝑠,𝑎, 𝑠 ′ Discount factor 𝛾∈[0,1] Fei Fang 12/8/2018
63
Policy in MDP: which action to take at each state
Recap Policy in MDP: which action to take at each state 𝜋(𝑠):𝑆→𝐴 if deterministic policy 𝜋(𝑠,𝑎):𝑆×𝐴→[0,1] if stochastic policy, satisfying 𝑎 𝜋(𝑠,𝑎) =1 Fei Fang 12/8/2018
64
Recap Bellman Equation for a MDP given policy 𝜋 (not necessarily optimal policy) Clearly 𝑈 𝜋 𝑠 exits when 𝛾<1. This is just the necessary condition for 𝑈 𝜋 𝑠 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋(𝑠) 𝑈 𝜋 ( 𝑠 ′ ) Fei Fang 12/8/2018
65
Bellman Optimality Equation
Recap Recall 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋(𝑠) 𝑈 𝜋 ( 𝑠 ′ ) Bellman Optimality Equation If 𝜋 ∗ is an optimal policy and 𝑈 ∗ (𝑠) is the expected reward from state 𝑠 following policy 𝜋, then 𝑈 ∗ (𝑠) should satisfy Why? When the policy is optimal, one always choose the action with the highest expected utility Discount Factor for Future 𝑈 ∗ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ ( 𝑠 ′ ) Immediate Reward Future Reward 𝜋 ∗ 𝑠 = argmax 𝑎 (𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ 𝑠 ′ ) = argmax 𝑎 ( 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ 𝑠 ′ ) So 𝑈 ∗ 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠, 𝜋 ∗ (𝑠) 𝑈 ∗ ( 𝑠 ′ ) =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ ( 𝑠 ′ ) Fei Fang 12/8/2018
66
Finding optimal policy through Value Iteration
Recap Recall 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋(𝑠) 𝑈 𝜋 ( 𝑠 ′ ) And 𝑈 ∗ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ ( 𝑠 ′ ) Finding optimal policy through Value Iteration Bellman Update Theorem: Value iteration converges to 𝑈 ∗ 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 𝑖 ( 𝑠 ′ ) Fei Fang 12/8/2018
67
Recap Discount Factor for Future How should the Bellman Update change if the reward function is given as 𝑅(𝑠,𝑎) or 𝑅(𝑠,𝑎, 𝑠 ′ )? Given 𝑅(𝑠,𝑎) 𝑈 𝑖+1 𝑠 ← max 𝑎 {𝑅 𝑠,𝑎 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 𝑖 𝑠 ′ } Given 𝑅(𝑠,𝑎,𝑠′) 𝑈 𝑖+1 𝑠 ← max 𝑎 { 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 (𝑅 𝑠,𝑎, 𝑠 ′ +𝛾 𝑈 𝑖 𝑠 ′ ) } Bellman Optimality Equation: 𝑈 ∗ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ ( 𝑠 ′ ) Immediate Reward Future Reward Bellman Update (Value Iteration): 𝑈 𝑖+1 𝑠 =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 𝑖 ( 𝑠 ′ ) Fei Fang 12/8/2018
68
Recap Finding optimal policy through Policy Iteration
Initialize policy 𝜋 Evaluate 𝜋 and compute or estimate 𝑈 𝜋 (Exact) Solve linear equations (Estimate) Run a few iterations of simplified Bellman update (Optional) Update for a subset of states only Update policy 𝜋 to be the greedy policy w.r.t. 𝑈 𝜋 Theorem: Policy iteration converges to 𝜋 ∗ 𝑈 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈( 𝑠 ′ ) 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) modified policy iteration asynchronous policy iteration 𝜋 𝑠 ← argmax 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 𝜋 ( 𝑠 ′ ) Fei Fang 12/8/2018
69
Recap How should the simplified Bellman Update change if the reward function is given as 𝑅(𝑠,𝑎) or 𝑅(𝑠,𝑎, 𝑠 ′ )? Given 𝑅(𝑠,𝑎) 𝑈 𝑖+1 𝑠 ←𝑅 𝑠,𝜋 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) Given 𝑅(𝑠,𝑎,𝑠′) 𝑈 𝑖+1 𝑠 ← 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 (𝑅 𝑠,𝜋 𝑠 ,𝑠′ +𝛾 𝑈 𝑖 𝑠 ′ ) Simplified Bellman Update (Policy Iteration): 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) Note that from the last update rule for 𝑅 𝑠,𝑎, 𝑠 ′ , you can easily derive the rule for 𝑅(𝑠) or 𝑅(𝑠,𝑎) as they are just special cases where 𝑅 𝑠,𝑎, 𝑠 ′ is the same for different 𝑎 and/or 𝑠′ Fei Fang 12/8/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.