Download presentation
Presentation is loading. Please wait.
Published byAnthony James Modified over 6 years ago
1
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Artificial Intelligence: Representation and Problem Solving Sequential Decision Making (4): Active Reinforcement Learning / 681 Instructors: Fei Fang (This Lecture) and Dave Touretzky Wean Hall 4126 12/8/2018
2
Recap MDP: (π,π΄,π,π
) RL: (π,π΄,?,?)
Policy π(π ):πβπ΄ if deterministic policy Find optimal policy: value iteration or policy iteration RL: (π,π΄,?,?) Passive RL: Evaluate a given policy π Model-based approach Estimate π and π
from sample trials Model-free approach Direct utility estimation TD learning Know exactly how the world works Donβt know how the world works Fei Fang 12/8/2018
3
Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation π-Greedy Boltzmann policy Fei Fang 12/8/2018
4
Model-Based Active RL with Random Actions
Choose actions randomly Estimate π and π
from sample trials (average counts) Use estimated π and π
to compute estimate of optimal values and optimal policy Will the computed values and policy converge to the true optimal values and policy in the limit of infinite data? Sufficient condition: If all states are reachable from any other state Be able to visit each state and take each action as many times as you want Fei Fang 12/8/2018
5
Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation π-Greedy Boltzmann policy Fei Fang 12/8/2018
6
Q-Value Recall Q-value
Similar to value function, but defined on state-action pair π π (π ,π): expected total reward from state π onward if taking action π in state π , and follow policy π afterward Bellman Equation given policy π: π π π =π
π +πΎ π β² π π β² π ,π(π ) π π ( π β² ) Bellman optimality condition, i.e., for optimal policy π β : π β π =π
π +πΎ max π π β² π π β² π ,π π β ( π β² ) That is π π π ,π =π
π +πΎ π β² π π β² π ,π π π ( π β² ) Obviously π π π = π π (π ,π π ) So π π π ,π =π
π +πΎ π β² π π β² π ,π π π (π β²,π π β² ) Fei Fang 12/8/2018
7
Optimal Q-Value Recall π π π =π
π +πΎ π β² π π β² π ,π(π ) π π ( π β² ) And π π π ,π =π
π +πΎ π β² π π β² π ,π π π ( π β² ) And π π π = π π (π ,π π ) When using optimal policy π β , we will take the action that leads to maximum total utility at each state Therefore (Or any probability distribution over actions with the highest π β π ,π if there is a tie) π β π = argmax π π β π ,π We have π β π = π β π , π β π = max π π β (π ,π) And π β π ,π =π
π +πΎ π β² π π β² π ,π π β ( π β² ) =π
π +πΎ π β² π π β² π ,π max πβ² π β (π β²,πβ²) Fei Fang 12/8/2018
8
Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation π-Greedy Boltzmann policy Fei Fang 12/8/2018
9
SARSA Recall TD learning: Given π, estimate π π (π ) through update
SARSA (State-Action-Reward-State-Action) Initialize policy π Given π, estimate π π (π ,π) through update Update policy based on π π π ,π (exploitation vs exploration) Repeat until πβ² converges (Is it guaranteed?) Similar to Policy Iteration On-policy algorithm, i.e., estimate the value of a policy while following that policy to choose actions π π β 1βπΌ π π +πΌ(π+πΎ π π β² ) π π π ,π β 1βπΌ π π π ,π +πΌ(π+πΎ π π π β²,π( π β² ) ) (Full exploitation) π β² π = max π π π π ,π Fei Fang 12/8/2018
10
On-Policy vs Off-Policy Methods
Two types of RL approaches: On-policy methods attempt to evaluate or improve the policy that is used to make decisions Off-policy methods evaluate or improve a policy different from that used to generate the data Fei Fang 12/8/2018
11
Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation π-Greedy Boltzmann policy Fei Fang 12/8/2018
12
Q-Learning Q-Learning: Similar to Value Iteration
Recall π β π ,π =π
π +πΎ π β² π π β² π ,π max πβ² π β (π β²,πβ²) Q-Learning: Similar to Value Iteration Directly estimate π β π ,π through update Given estimated value of π β π ,π , derive estimate of optimal policy π β π ,π β 1βπΌ π β π ,π +πΌ(π+πΎ max πβ² π β π β²,πβ² ) π β π = max π π β π ,π Fei Fang 12/8/2018
13
Q-Learning Recall update π β π ,π β 1βπΌ π β π ,π +πΌ(π+πΎ max πβ² π β π β²,πβ² ) Q-Learning Initialize π β (π ,π) arbitrarily and π β π‘ππππππππ π‘ππ‘π π ,ππ’ππ =0 Repeat (for each episode) Initialize state π Repeat (for each step of episode) Choose action π that is available at π following some exploration policy Take action π, observe reward π, and next state π β² Update π β π ,π β 1βπΌ π β π ,π +πΌ(π+πΎ max πβ² π β π β²,πβ² ) π βπ β² Until π is terminal state Observe reward π of terminal state Update π β π ,ππ’ππ as π β π ,ππ’ππ =π Until πΎ episodes/trials are run Return policy π β π = max π π β π ,π Off policy algorithm: the policy is being evaluated (estimation policy) is unrelated to the policy being followed (behavior policy) Fei Fang 12/8/2018
14
Q-Learning Example S1 S2 S3 S4 S5 S6: END a12 a23 a14 a21 a32 a36 a25
6 states, S1,..S6 12 actions π ππ Deterministic state transitions (but you donβt know this beforehand) π
=100 in S6, π
=0 otherwise (again, you donβt know this) Use πΎ=0.5, πΌ = 1 Random behavior policy Fei Fang 12/8/2018
15
Q-Learning Example π β π ,π β 1βπΌ π β π ,π +πΌ π+πΎ max π β² π β π β² , π β²
π
=100 in S6, πΎ=0.5, πΌ = 1 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 π(π1, π 12 ) π(π1, π 14 ) π(π2, π 21 ) π(π2, π 25 ) π(π2, π 23 ) π(π3, π 32 ) π(π3, π 36 ) π(π4, π 41 ) π(π4, π 45 ) π(π5, π 52 ) π(π5, π 54 ) π(π5, π 56 ) π(π6,ππ’ππ) Start at S1, available actions: π 12 , π 14 Probability of choosing each of them: 0.5 Choose π 12 Get reward 0, get to state S2 Update state-value function π β π1, π 12 β 1βπΌ π β π1, π 12 +πΌ π+πΎ max π β² β π 21 , π 23 , π π β π2, π β² =0 Fei Fang 12/8/2018
16
Q-Learning Example π β π ,π β 1βπΌ π β π ,π +πΌ π+πΎ max π β² π β π β² , π β²
π
=100 in S6, πΎ=0.5, πΌ = 1 π(π1, π 12 ) π(π1, π 14 ) π(π2, π 21 ) π(π2, π 25 ) π(π2, π 23 ) π(π3, π 32 ) π(π3, π 36 ) π(π4, π 41 ) π(π4, π 45 ) π(π5, π 52 ) π(π5, π 54 ) π(π5, π 56 ) π(π6,ππ’ππ) S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 At S2, available actions: π 21 , π 23 , π 25 Probability of choosing each of them: 1 3 Choose π 23 Get reward 0, get to state S3 Update state-value function π β π2, π 23 β 1βπΌ π β π2, π 23 +πΌ π+πΎ max π β² β π 32 , π π β π3, π β² =0 Fei Fang 12/8/2018
17
Q-Learning Example π β π ,π β 1βπΌ π β π ,π +πΌ π+πΎ max π β² π β π β² , π β²
π
=100 in S6, πΎ=0.5, πΌ = 1 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 π(π1, π 12 ) π(π1, π 14 ) π(π2, π 21 ) π(π2, π 25 ) π(π2, π 23 ) π(π3, π 32 ) π(π3, π 36 ) π(π4, π 41 ) π(π4, π 45 ) π(π5, π 52 ) π(π5, π 54 ) π(π5, π 56 ) π(π6,ππ’ππ) At S3, available actions: π 32 , π 36 Probability of choosing each of them: 0.5 Choose π 36 Get reward 0, get to state S6 Update state-value function π β π3, π 36 β 1βπΌ π β π3, π 36 +πΌ π+πΎ max π β² β ππ’ππ π β π6, π β² =0 Fei Fang 12/8/2018
18
Q-Learning Example π β π ,π β 1βπΌ π β π ,π +πΌ π+πΎ max π β² π β π β² , π β²
π
=100 in S6, πΎ=0.5, πΌ = 1 π(π1, π 12 ) π(π1, π 14 ) π(π2, π 21 ) π(π2, π 25 ) π(π2, π 23 ) π(π3, π 32 ) π(π3, π 36 ) π(π4, π 41 ) π(π4, π 45 ) π(π5, π 52 ) π(π5, π 54 ) π(π5, π 56 ) π(π6,ππ’ππ) β100 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 Terminal state, get reward 100, π β π6,ππ’ππ β100 Fei Fang 12/8/2018
19
Q-Learning Example π β π ,π β 1βπΌ π β π ,π +πΌ π+πΎ max π β² π β π β² , π β²
π
=100 in S6, πΎ=0.5, πΌ = 1 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 π(π1, π 12 ) π(π1, π 14 ) π(π2, π 21 ) π(π2, π 25 ) π(π2, π 23 ) π(π3, π 32 ) π(π3, π 36 ) π(π4, π 41 ) π(π4, π 45 ) π(π5, π 52 ) π(π5, π 54 ) π(π5, π 56 ) π(π6,ππ’ππ) 100 Start a new episode! Start at S2, available actions: π 21 , π 23 , π 25 Probability of choosing each of them: 1 3 Choose π 23 Get reward 0, get to state S3 Update state-value function π β π2, π 23 β 1βπΌ π β π2, π 23 +πΌ π+πΎ max π β² β π 32 , π π β π3, π β² =0 Fei Fang 12/8/2018
20
Q-Learning Example π β π ,π β 1βπΌ π β π ,π +πΌ π+πΎ max π β² π β π β² , π β²
π
=100 in S6, πΎ=0.5, πΌ = 1 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 π(π1, π 12 ) π(π1, π 14 ) π(π2, π 21 ) π(π2, π 25 ) π(π2, π 23 ) π(π3, π 32 ) π(π3, π 36 ) β50 π(π4, π 41 ) π(π4, π 45 ) π(π5, π 52 ) π(π5, π 54 ) π(π5, π 56 ) π(π6,ππ’ππ) 100 At S3, available actions: π 32 , π 36 Probability of choosing each of them: 0.5 Choose π 36 Get reward 0, get to state S6 Update state-value function π β π3, π 36 β 1βπΌ π β π3, π 36 +πΌ π+πΎ max π β² β ππ’ππ π β π6, π β² =50 Fei Fang 12/8/2018
21
Implication: Let πΌ decrease over time
Q-Learning Impact of πΌ Implication: Let πΌ decrease over time Fei Fang 12/8/2018
22
π
π =0 for non-terminal states, πΎ=0.9, πΌ π π =1/π(π )
Q-Learning Example Grid world example π
π =0 for non-terminal states, πΎ=0.9, πΌ π π =1/π(π ) π β π ,π β 1βπΌ(π π ) π β π ,π +πΌ(π π )(π+πΎ max πβ² π β π β²,πβ² ) Update counter π(π ) before computing πΌ Fei Fang 12/8/2018
23
Start with Q-value being 0
Q-Learning Example Recall update π β π ,π β 1βπΌ(π π ) π β π ,π +πΌ(π π )(π+πΎ max πβ² π β π β²,πβ² ) Start with Q-value being 0 Following uniform random strategy to select actions Trial 1: No meaningful update until reaching a terminal state since reward is 0 for non-terminal state Say the trial is 1,1 β 1,2 β 1,3 β 2,3 β(3,3) Update π 3,3 ,ππ’ππ =1 Fei Fang 12/8/2018
24
Q-Learning Example After trial 1:
Recall update π β π ,π β 1βπΌ(π π ) π β π ,π +πΌ(π π )(π+πΎ max πβ² π β π β²,πβ² ) After trial 1: Fei Fang 12/8/2018
25
Q-Learning Example Trial 2:
Recall update π β π ,π β 1βπΌ(π π ) π β π ,π +πΌ(π π )(π+πΎ max πβ² π β π β²,πβ² ) Trial 2: Say the trial is 1,1 β 1,2 β 2,2 β 1,2 β 1,3 β 2,3 β 3,3 No meaningful update except for (2,3) Update π 2,3 ,πππ’π‘β β β0 =0.45 Fei Fang 12/8/2018
26
Q-Learning Example After trial 2:
Recall update π β π ,π β 1βπΌ(π π ) π β π ,π +πΌ(π π )(π+πΎ max πβ² π β π β²,πβ² ) After trial 2: Fei Fang 12/8/2018
27
Q-Learning Example Trial 3:
Recall update π β π ,π β 1βπΌ(π π ) π β π ,π +πΌ(π π )(π+πΎ max πβ² π β π β²,πβ² ) Trial 3: Start the trial: 1,1 β 2,1 β 1,1 β 1,2 β 1,3 Update π 1,3 ,πππ’π‘β β β0 =0.135 Continue trial β 2,3 β(3,3) Update π 2,3 ,πππ’π‘β =0.45+1/3(0+0.9(1) β0.45)=0.6 Fei Fang 12/8/2018
28
Q-Learning Example After trial 3:
Recall update π β π ,π β 1βπΌ(π π ) π β π ,π +πΌ(π π )(π+πΎ max πβ² π β π β²,πβ² ) After trial 3: Fei Fang 12/8/2018
29
Q-Learning Properties
If acting randomly, Q-learning converges to optimal state-action values, and also therefore finds optimal policy Off-policy learning Can act in one way But learning values of another policy (the optimal one!) Acting randomly is sufficient, but not necessary, to learn the optimal values and policy Fei Fang 12/8/2018
30
Quiz 1 Is the following algorithm guaranteed to learn optimal policy? A: Yes B: No C: Not sure Some Algorithm Initialize π β (π ,π) arbitrarily and π β π‘ππππππππ π‘ππ‘π π ,ππππ =0 Repeat (for each episode) Initialize state π Repeat (for each step of episode) Choose action π= argmax π β² π β (π ,πβ²) Take action π, observe reward π, and next state π β² Update π β π ,π β 1βπΌ π β π ,π +πΌ(π+πΎ max πβ² π β π β²,πβ² ) π βπ β² Until π is terminal state Observe reward π of terminal state Update π β π ,ππππ as π β π ,ππππ =π Until πΎ episodes/trials are run Return policy π β π = max π π β π ,π Fei Fang 12/8/2018
31
Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation π-Greedy Boltzmann policy Fei Fang 12/8/2018
32
Exploration vs Exploitation
Fei Fang 12/8/2018
33
Simple Approach: π-Greedy
With probability 1βπ Choose action π= argmax π β² π β (π ,πβ²) With probability π Select a random action For Q-learning Guaranteed to compute optimal policy π β based on π β (π ,π) given enough samples with π>0 However, the policy the agent is following is never the same as π β (because it select a random action with probability π) Fei Fang 12/8/2018
34
Simple Approach: π-Greedy
With probability 1βπ Choose action π= argmax π β² π β (π ,πβ²) With probability π Select a random action For SARSA With fixed π>0, may not converge to optimal policy π β even if given enough samples Fei Fang 12/8/2018
35
Greedy in Limit of Infinite Exploration (GLIE)
π-Greedy with decayed π over time, e.g., π= 1/ π(π ) Advantage: Eventually the agent will be following optimal policy almost all the time SARSA can converge to optimal policy given enough samples Fei Fang 12/8/2018
36
Random terminal state utility sampled from [0,1]
Impact of π Random terminal state utility sampled from [0,1] Fei Fang 12/8/2018
37
Impact of π Fei Fang 12/8/2018
38
Choose action π with probability π(π|π )
Boltzmann Policy Choose action π with probability π(π|π ) Fei Fang 12/8/2018
39
Quiz 2 If we want that eventually the agent will be following optimal policy almost all the time when using Boltzmann policy to sample actions, how should the value of π change as the learning progresses? A: Increase B: Decrease C: No change Fei Fang 12/8/2018
40
Summary Reinforcement Learning (RL) Active RL Model-free Active RL
Model-based Active RL SARSA, Q-Learning (with some exploratory policy) Estimate π and π
through sampling Fei Fang 12/8/2018
41
SARSA vs Q-Learning SARSA Q-Learning Fei Fang 12/8/2018
42
Acknowledgment Some slides are borrowed from previous slides made by Tai Sing Lee and Zico Kolter, and some examples are borrowed from Meg Aycinena and Emma Brunskill Fei Fang 12/8/2018
43
http://courses.csail.mit.edu/6.825/fall05/rl_lecture/rl_e xamples.pdf
Other Resources xamples.pdf s16/www/slides/rl.pdf df Fei Fang 12/8/2018
44
Backup Slides Fei Fang 12/8/2018
45
Terminal States and Reward
What is terminal state and when do agent get reward You will see different formulations and different definitions of terminal state and reward In some formulation, a terminal state cannot have a reward In some formulation, a terminal state have a reward In some formulations, agent get the reward π
(π ) every time it take an action from state π In some formulations, agent get the reward π
(π β²) every time it take an action from state π , and end up at state π β² Fei Fang 12/8/2018
46
Terminal States and Reward
In this lecture, we use the following formulation or interpretation Each state has a reward π
(π ) or π
(π ,π) or π
(π ,π, π β² ), including the terminal state. For a terminal state π , the only available action is βnullβ, and the π β² can only be βnullβ (or βexitedβ) If at time π‘, the agent is at state π and takes action π, and observes that it ends up at state π β², and gets a reward of π
(π ) or π
(π ,π) or π
(π ,π, π β² ). Then time counter increments by 1, i.e., now it is time π‘+1, the agent can now take an action starting from state π β² Why matter? For example, in Q-learning, you need to set Q value of a terminal state π to be the reward you observe when you take the βnullβ action When you read other books / exercise questions, pay attention to what their formulation is Fei Fang 12/8/2018
47
Terminal States and Reward
How to reduce a game in our formulation to other alternative formulations? We call the original game in this formulation game πΊ, with reward function π
, and we will create a new game πΊ in the alternative formulation, with reward function π
Fei Fang 12/8/2018
48
Terminal States and Reward
How to reduce a game in our formulation to other alternative formulations? (A terminal state cannot have a reward) Option 1: Create a new game πΊ , which has a new absorb state π , all terminal states in the original game is linked to this absorb state through action βnullβ, and only the absorb state is the terminal state, and it has reward 0 Why matter? For example, in Q-learning for game πΊ , you need to set Q value of a terminal state π (which is the absorb state) to be 0 before you do any Q-value update, and never update the Q-value for the terminal state Fei Fang 12/8/2018
49
Terminal States and Reward
How to reduce a game in our formulation to other alternative formulations? (A terminal state cannot have a reward) Option 2: Create a new game πΊ , and the reward function is in the form π
(π ,π, π β² ). When the agent take an action π from state π , which reaches terminal state π β², the reward is π
π ,π, π β² =π
π +π
( π β² ) Fei Fang 12/8/2018
50
Terminal States and Reward
How to reduce a game in our formulation to other alternative formulations? (agent get the reward π
(π β²) every time it take an action from state π , and end up at state π β²) Create a new game πΊ , which has a new starting state π , it is linked to all possible starting states through action βnullβ, with uniform random transition probability Why matter? For example, in Q-learning for game πΊ , you need to set Q value of a terminal state π to be 0 before you do any Q-value update, and never update the Q-value for the terminal state Fei Fang 12/8/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.