Instructors: Fei Fang (This Lecture) and Dave Touretzky

Instructors: Fei Fang (This Lecture) and Dave Touretzky
Artificial Intelligence: Representation and Problem Solving Sequential Decision Making (3): Passive Reinforcement Learning / 681 Instructors: Fei Fang (This Lecture) and Dave Touretzky Wean Hall 4126 12/8/2018

Recap MDP: (𝑆,𝐴,𝑃,𝑅) Policy 𝜋(𝑠):𝑆→𝐴 if deterministic policy
Find optimal policy: value iteration or policy iteration You know exactly how the world works! Fei Fang 12/8/2018

What is Reinforcement Learning Passive RL
Outline What is Reinforcement Learning Passive RL Model-based Passive RL Model-free Passive RL Direct Utility Estimation Temporal Difference Learning Fei Fang 12/8/2018

What is Reinforcement Learning
Reinforcement Learning: Learn an optimal policy for the environment without having a complete model Learn through Trial and Error Don’t have a simulator! Have to actually learn what happens if take an action in a state Fei Fang 12/8/2018

Reinforcement Learning: Learn an optimal policy for the environment without having a complete model RL: Don’t know 𝑃 or 𝑅 (or just hard to enumerate) Goal: Find policy 𝜋 MDP: (𝑆,𝐴,𝑃,𝑅) Goal: Find policy 𝜋 Fei Fang 12/8/2018

The agents can ”sense” the environment (it knows the state) and has goals Learning from interaction with the environment Trial and error search (Delayed) Rewards (Advisory signals ≠ errors signals) What actions to take Exploration-Exploitation dilemma Fei Fang 12/8/2018

RL Applications / Examples
Bipedal Robot Learn to Walk The following video shows research highlights from Toddler, a simple 3D dynamic biped that was able to quickly and reliably learn to walk. The beginning of the video demonstrates the robot's ability to walk passively downhill on a treadmill with the computer turned off. The robot was then placed on flat terrain with the computer switched on, and tasked with acquiring the same gait without assistance from gravity, but rather by learning a feedback controller. This learning occurred in less than 20 minutes, using only trials implemented on the real robot (no simulations). The learning algorithm continues to quickly adapt as the robot walks over different terrain. Sebastian Seung Lab at MIT. Fei Fang 12/8/2018

Helicopter Manoeuvres Fei Fang 12/8/2018

Learn to Play Atari Games Fei Fang 12/8/2018

Passive RL Passive Learning: Agent’s policy 𝜋 is fixed; Learn state value 𝑈(𝑠) without knowing the transition model 𝑃( 𝑠 ′ |𝑠,𝑎) or the reward function 𝑅(𝑠) ahead of time Recall in policy iteration, we have learned how to evaluate a policy, i.e., compute 𝑉(𝑠) given 𝑃( 𝑠 ′ |𝑠,𝑎) and 𝑅(𝑠) Fei Fang 12/8/2018

Passive Reinforcement Learning
Two Approaches Model-based: Build a model then evaluate Model-free: Directly evaluate without building a model Transition Model? P(s’|a,s)=?, R(s)=?,… State Action Reward model? Given policy 𝜋 Remember, we know 𝑆 and 𝐴, just not 𝑃( 𝑠 ′ |𝑠,𝑎) and 𝑅(𝑠). Agent Fei Fang 12/8/2018

Model-Based Passive Reinforcement Learning
Follow policy 𝜋, perform many trials/experiments to get sample sequences Estimate MDP model parameters 𝑃( 𝑠 ′ |𝑠,𝑎) and 𝑅(𝑠) given observed transitions and rewards If finite set of states and actions, can just count and average counts Use estimated MDP to evaluate policy 𝜋 Fei Fang 12/8/2018

You are in an environment (you may not know the details) You are given a policy 𝜋 Fei Fang 12/8/2018

Example Let’s start the trials! Environment Start at (1,1) Policy
Fei Fang 12/8/2018

Example Let’s start the trials! Environment Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Policy Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) Policy Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Policy Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) Tried but didn’t move Policy Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Try again Policy Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) Policy Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Fei Fang 12/8/2018

𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Estimate 𝑃( 𝑠 ′ |𝑠,𝑎): 𝑃 2,1 2,1 ,𝑡𝑟𝑖𝑔ℎ𝑡 = 1 2 Fei Fang 12/8/2018

Example Let’s start the trials! Environment We can run more trials!
Trial 1: 1,1 → 2,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State  with reward −1, intermediate steps returns reward −0.01 Policy Fei Fang 12/8/2018

Example Let’s start the trials! Environment We can run more trials!
Trial 1: 1,1 → 2,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State  with reward −1, intermediate steps returns reward −0.01 Trial 2: 1,1 → 1,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State  with reward −1, intermediate steps returns reward −0.01 Policy Trial 3: 1,1 → 2,1 → 3,1 → 3,2 → (4,2) Terminal State  with reward −1, intermediate steps returns reward −0.01 Estimate 𝑃 2,1 2,1 ,𝑡𝑟𝑖𝑔ℎ𝑡 = 1 4 Estimate 𝑅 2,1 =−0.01 Fei Fang 12/8/2018

Empirical estimate of transition probability 𝑃 𝑠 ′ 𝑠,𝑎 = 𝑁 𝑠 𝑖 =𝑠, 𝑎 𝑖 =𝑎, 𝑠 𝑖+1 = 𝑠 ′ 𝑁( 𝑠 𝑖 =𝑠, 𝑎 𝑖 =𝑎) 𝑅 𝑠 = 𝑠 𝑖 =𝑠 𝑟(𝑠) 𝑁( 𝑠 𝑖 =𝑠) Does the trials give us all the parameters for a MDP? Fei Fang 12/8/2018

Environment Recall the first trial Start at (1,1) 𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Estimate 𝑃( 𝑠 ′ |𝑠,𝑎): 𝑃 2,1 2,1 ,𝑡𝑢𝑝 =? Never seen Fei Fang 12/8/2018

Environment Recall all trials Trial 1: 1,1 → 2,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State  with reward −1, intermediate steps returns reward −0.01 Trial 2: 1,1 → 1,1 → 2,1 → 3,1 → 4,1 →(4,2) Terminal State  with reward −1, intermediate steps returns reward −0.01 Policy Trial 3: 1,1 → 2,1 → 3,1 → 3,2 → (4,2) Terminal State  with reward −1, intermediate steps returns reward −0.01 Estimate 𝑃( 𝑠 ′ |𝑠,𝑎): 𝑃 2,1 2,1 ,𝑡𝑢𝑝 =? Never taken due to 𝜋 Fei Fang 12/8/2018

Given estimates of the transition model 𝑃 and reward model 𝑅, we can do MDP policy evaluation to compute the value of our policy (Exact) Solve linear equations 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝜋 ( 𝑠 ′ ) (Estimate) Run a few iterations of simplified Bellman update 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) Do missing values matter for computing policy value? Fei Fang 12/8/2018

Exercise Environment If you complete 10 trials based on policy 𝜋, would the computed state value given policy 𝜋, i.e., 𝑈 𝜋 (𝑠), likely to be correct? Policy Fei Fang 12/8/2018

Advantage: Make good use of data Disadvantage: Require building the actual MDP model. Intractable if state space is too large Fei Fang 12/8/2018

Passive Reinforcement Learning
Two Approaches Model-based: Build a model then evaluate Model-free: Directly evaluate without building a model Transition Model? 𝑈 𝜋 (𝑠)=? State Action Reward model? Given policy 𝜋 Remember, we know 𝑆 and 𝐴, just not 𝑃( 𝑠 ′ |𝑠,𝑎) and 𝑅(𝑠). Agent Fei Fang 12/8/2018

Model-Free Passive Reinforcement Learning
Environment Recall the first trial 𝛾=1 Start at (1,1) 𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Estimate 𝑈 𝜋 (𝑠): 𝑈 𝜋 (4,2) =? 𝑈 𝜋 (3,2) =? Fei Fang 12/8/2018

Direct Utility Estimation
Reward-to-go: Expected total reward from that state onward When a trial hits a state, view it as a sample of the total reward from that state onward Compute the average as the reward-to-go 𝑈 (𝑠)=Reward from that state onward with discount( 𝑠 𝑖 =𝑠)/𝑁( 𝑠 𝑖 =𝑠) Fei Fang 12/8/2018

Example Recall the first trial Environment 𝛾=1 Start at (1,1)
𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Estimate 𝑈 𝜋 ((2,1)): Fei Fang 12/8/2018

Estimate 𝑈 𝜋 (2,1) given 𝛾=0 or 0.5?
Example Environment Estimate 𝑈 𝜋 (2,1) given 𝛾=0 or 0.5? Start at (1,1) 𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate Fei Fang 12/8/2018

Direct Utility Estimation
Disadvantage: Need to wait until you reach terminal state Estimate 𝑈(𝑠) and 𝑈( 𝑠 ′ ) separately, ignoring the relation Converge very slowly 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝜋 ( 𝑠 ′ ) Fei Fang 12/8/2018

Temporal-Different Learning (TD Learning)
Key ideas: Do not wait until the trial terminates, update 𝑈(𝑠) after each state transition Use running average, i.e., balancing previous estimate and the latest sample Naturally, more likely outcome 𝑠′ will contribute to the update more often Recall how we estimate state value given the model and policy Using simplified Bellman update Update for a subset of states only 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) Fei Fang 12/8/2018

Temporal-Different Learning
Suppose we are in one step of a trial: agent is in state 𝑠, take action 𝑎=𝜋(𝑠), get reward 𝑟 end up in state 𝑠′ Can we update just for 𝑠 using simplified Bellman update? No, because we don’t know 𝑃 𝑠 ′′ 𝑠,𝑎 However, 𝑠′ is sampled from the distribution 𝑃( 𝑠 ′′ |𝑠,𝑎). Can we directly estimate 𝑈 𝑖+1 𝑠 from this sample? Not good. A single sample can have very high variance. Value of 𝑈 𝑖+1 𝑠 can be very different from 𝑈 𝑖 (𝑠) TD Learning: Make use of 𝑈 𝑖 (𝑠) to “smooth” the update Running average Fei Fang 12/8/2018

Exponential Moving Average
𝑈 𝑖+1 𝑠 ← 1−𝛼 𝑈 𝑖 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑖 𝑠 ′ ) Let 𝑥 0 =0, 𝑥 𝑖+1 = 1−𝛼 𝑥 𝑖 +𝛼 𝑥 𝑖 Then Makes more recent samples more important. Forgets about past. Fei Fang 12/8/2018

TD Algorithm Initialize the estimate of 𝑈(𝑠) as 𝑈 (𝑠)←0,∀𝑠 Repeat (for each episode/trial) Initialize state 𝑠 Repeat (for each step of episode) Take action 𝑎=𝜋 𝑠 Observe reward 𝑟 and next state 𝑠′ Update 𝑈 𝑠 as 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) 𝑠←𝑠′ Until 𝑠 is terminal state Observe reward 𝑟 of terminal state Update 𝑈 𝑠 as 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼×𝑟 Until 𝐾 episodes/trial are run Return 𝑈 (𝑠) 𝛼∈[0,1] is a fixed parameter determining how much “weight” we given to the old value 𝑈 𝑠 and to the new sample 𝑟+𝛾 𝑈 𝑠 ′ Fei Fang 12/8/2018

Example Initialize 𝑈 (𝑠) to be 0 𝛼=0.1,𝛾=1
𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Initialize 𝑈 (𝑠) to be 0 State 𝑈 (𝑠) (1,1) (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Policy Fei Fang 12/8/2018

Example 𝛼=0.1,𝛾=1 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment
𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Start at (1,1) 𝑠=(1,1) action=tright State 𝑈 (𝑠) (1,1) (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Reward=−0.01; End up at 𝑠′=(2,1) 𝑈 1,1 =0.9∗0+0.1∗ − =−0.001 Policy Fei Fang 12/8/2018

𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Start at (1,1) 𝑠=(1,1) action=tright State 𝑈 (𝑠) (1,1) -0.001 (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(2,1) Policy 𝑈 2,1 =0.9∗0+0.1∗ − =−0.001 Fei Fang 12/8/2018

𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Start at (1,1) 𝑠=(1,1) action=tright State 𝑈 (𝑠) (1,1) -0.001 (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(3,1) Policy 𝑈 2,1 =0.9∗− ∗ − =−0.0019 Fei Fang 12/8/2018

𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment Start at (1,1) 𝑠=(1,1) action=tright State 𝑈 (𝑠) (1,1) -0.001 (1,2) (1,3) (4,1) (2,1) (2,3) (3,1) (3,2) (3,3) Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright Reward=−0.01; End up at 𝑠′=(3,1) Policy 𝑠=(3,1) action=tright Reward=−0.01; End up at 𝑠′=(4,1) 𝑈 3,1 =0.9∗0+0.1∗ − =−0.001 Fei Fang 12/8/2018

After the first trial, what is 𝑈 4,1 ?
Quiz 1 𝛼=0.1,𝛾=1 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) Environment After the first trial, what is 𝑈 4,1 ? Start at (1,1) 𝑠=(1,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(2,1) 𝑠=(2,1) action=tright (try going right) based on 𝜋 Reward=−0.01; End up at 𝑠′=(3,1) 𝑠=(3,1) action=tright (try going right) based on 𝜋 Policy Reward=−0.01; End up at 𝑠′=(4,1) 𝑠=(4,1) action=tup (try going up) based on 𝜋 Reward=−0.01; End up at 𝑠′=(4,2) 𝑠=(4,2) No action available. Reward=−1; Terminate A: ; B: -0.01; C: -1; D: Fei Fang 12/8/2018

Impact of 𝛼 In practice, we often decrease the value of 𝛼 as the learning progresses Set a counter 𝑁[𝑠] for each state, representing how many time the state has been visited 𝛼 changes as 𝑁[𝑠] increases, i.e., 𝛼 is a function of 𝑁[𝑠] 𝑈 𝑠 ← 1−𝛼(𝑁[𝑠]) 𝑈 𝑠 +𝛼(𝑁[𝑠])(𝑟+𝛾 𝑈 𝑠 ′ ) Fei Fang 12/8/2018

Also, we may set 𝑈 (𝑠) to be the sample reward when we see 𝑠 for the first time (see textbook) TD Algorithm Initialize the estimate of 𝑈(𝑠) as 𝑈 (𝑠)←0,∀𝑠 Repeat (for each episode/trial) Initialize state 𝑠 Repeat (for each step of episode) Take action 𝑎=𝜋 𝑠 Observe reward 𝑟 and next state 𝑠′ Update 𝑈 𝑠 as 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼(𝑟+𝛾 𝑈 𝑠 ′ ) 𝑠←𝑠′ Until 𝑠 is terminal state Observe reward 𝑟 of terminal state Update 𝑈 𝑠 as 𝑈 𝑠 ← 1−𝛼 𝑈 𝑠 +𝛼×𝑟 Until 𝐾 episodes/trial are run Return 𝑈 (𝑠) Fei Fang 12/8/2018

Reinforcement Learning Passive RL
Summary Reinforcement Learning Passive RL Model-based Passive RL Model-free Passive RL Direct Utility Estimation Temporal Difference Learning Fei Fang 12/8/2018

Summary Reinforcement Learning (RL) Passive RL Model-free Passive RL
Model-based Passive RL Direct Utility Estimation Estimate 𝑃 and 𝑅 through sampling TD Learning Fei Fang 12/8/2018

Next lecture: Active RL
Passive RL Policy 𝜋 is given However… Agent wants to ultimately learn to act to gather high reward in the environment. Using a deterministic policy, gives agent no experience for other actions (that are not included in the policy) Next lecture: Active RL Agent decides what action to take with the goal of learning an optimal policy Fei Fang 12/8/2018

Acknowledgment Some slides are borrowed from previous slides made by Tai Sing Lee and Zico Kolter Fei Fang 12/8/2018

Recap of Value Iteration and Policy Iteration
Backup Slides Recap of Value Iteration and Policy Iteration Fei Fang 12/8/2018

Markov Decision Process
Recap Markov Decision Process State 𝑆 Action 𝐴 Sometimes different states have different available actions; Denote as 𝐴(𝑠) Markovian transition model 𝑃( 𝑠 ′ |𝑠,𝑎) Reward function 𝑅(𝑠) or 𝑅(𝑠,𝑎) or 𝑅 𝑠,𝑎, 𝑠 ′ Discount factor 𝛾∈[0,1] Fei Fang 12/8/2018

Policy in MDP: which action to take at each state
Recap Policy in MDP: which action to take at each state 𝜋(𝑠):𝑆→𝐴 if deterministic policy 𝜋(𝑠,𝑎):𝑆×𝐴→[0,1] if stochastic policy, satisfying 𝑎 𝜋(𝑠,𝑎) =1 Fei Fang 12/8/2018

Recap Bellman Equation for a MDP given policy 𝜋 (not necessarily optimal policy) Clearly 𝑈 𝜋 𝑠 exits when 𝛾<1. This is just the necessary condition for 𝑈 𝜋 𝑠 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋(𝑠) 𝑈 𝜋 ( 𝑠 ′ ) Fei Fang 12/8/2018

Bellman Optimality Equation
Recap Recall 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋(𝑠) 𝑈 𝜋 ( 𝑠 ′ ) Bellman Optimality Equation If 𝜋 ∗ is an optimal policy and 𝑈 ∗ (𝑠) is the expected reward from state 𝑠 following policy 𝜋, then 𝑈 ∗ (𝑠) should satisfy Why? When the policy is optimal, one always choose the action with the highest expected utility Discount Factor for Future 𝑈 ∗ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ ( 𝑠 ′ ) Immediate Reward Future Reward 𝜋 ∗ 𝑠 = argmax 𝑎 (𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ 𝑠 ′ ) = argmax 𝑎 ( 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ 𝑠 ′ ) So 𝑈 ∗ 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠, 𝜋 ∗ (𝑠) 𝑈 ∗ ( 𝑠 ′ ) =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ ( 𝑠 ′ ) Fei Fang 12/8/2018

Finding optimal policy through Value Iteration
Recap Recall 𝑈 𝜋 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋(𝑠) 𝑈 𝜋 ( 𝑠 ′ ) And 𝑈 ∗ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ ( 𝑠 ′ ) Finding optimal policy through Value Iteration Bellman Update Theorem: Value iteration converges to 𝑈 ∗ 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 𝑖 ( 𝑠 ′ ) Fei Fang 12/8/2018

Recap Discount Factor for Future How should the Bellman Update change if the reward function is given as 𝑅(𝑠,𝑎) or 𝑅(𝑠,𝑎, 𝑠 ′ )? Given 𝑅(𝑠,𝑎) 𝑈 𝑖+1 𝑠 ← max 𝑎 {𝑅 𝑠,𝑎 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 𝑖 𝑠 ′ } Given 𝑅(𝑠,𝑎,𝑠′) 𝑈 𝑖+1 𝑠 ← max 𝑎 { 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 (𝑅 𝑠,𝑎, 𝑠 ′ +𝛾 𝑈 𝑖 𝑠 ′ ) } Bellman Optimality Equation: 𝑈 ∗ 𝑠 =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 ∗ ( 𝑠 ′ ) Immediate Reward Future Reward Bellman Update (Value Iteration): 𝑈 𝑖+1 𝑠 =𝑅 𝑠 +𝛾 max 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 𝑖 ( 𝑠 ′ ) Fei Fang 12/8/2018

Recap Finding optimal policy through Policy Iteration
Initialize policy 𝜋 Evaluate 𝜋 and compute or estimate 𝑈 𝜋 (Exact) Solve linear equations (Estimate) Run a few iterations of simplified Bellman update (Optional) Update for a subset of states only Update policy 𝜋 to be the greedy policy w.r.t. 𝑈 𝜋 Theorem: Policy iteration converges to 𝜋 ∗ 𝑈 𝑠 =𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈( 𝑠 ′ ) 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) modified policy iteration asynchronous policy iteration 𝜋 𝑠 ← argmax 𝑎 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝑎 𝑈 𝜋 ( 𝑠 ′ ) Fei Fang 12/8/2018

Recap How should the simplified Bellman Update change if the reward function is given as 𝑅(𝑠,𝑎) or 𝑅(𝑠,𝑎, 𝑠 ′ )? Given 𝑅(𝑠,𝑎) 𝑈 𝑖+1 𝑠 ←𝑅 𝑠,𝜋 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) Given 𝑅(𝑠,𝑎,𝑠′) 𝑈 𝑖+1 𝑠 ← 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 (𝑅 𝑠,𝜋 𝑠 ,𝑠′ +𝛾 𝑈 𝑖 𝑠 ′ ) Simplified Bellman Update (Policy Iteration): 𝑈 𝑖+1 𝑠 ←𝑅 𝑠 +𝛾 𝑠 ′ 𝑃 𝑠 ′ 𝑠,𝜋 𝑠 𝑈 𝑖 ( 𝑠 ′ ) Note that from the last update rule for 𝑅 𝑠,𝑎, 𝑠 ′ , you can easily derive the rule for 𝑅(𝑠) or 𝑅(𝑠,𝑎) as they are just special cases where 𝑅 𝑠,𝑎, 𝑠 ′ is the same for different 𝑎 and/or 𝑠′ Fei Fang 12/8/2018

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Similar presentations

Presentation on theme: "Instructors: Fei Fang (This Lecture) and Dave Touretzky"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Similar presentations

Presentation on theme: "Instructors: Fei Fang (This Lecture) and Dave Touretzky"— Presentation transcript:

Similar presentations

About project

Feedback