Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Similar presentations


Presentation on theme: "Instructors: Fei Fang (This Lecture) and Dave Touretzky"β€” Presentation transcript:

1 Instructors: Fei Fang (This Lecture) and Dave Touretzky
Artificial Intelligence: Representation and Problem Solving Sequential Decision Making (4): Active Reinforcement Learning / 681 Instructors: Fei Fang (This Lecture) and Dave Touretzky Wean Hall 4126 12/8/2018

2 Recap MDP: (𝑆,𝐴,𝑃,𝑅) RL: (𝑆,𝐴,?,?)
Policy πœ‹(𝑠):𝑆→𝐴 if deterministic policy Find optimal policy: value iteration or policy iteration RL: (𝑆,𝐴,?,?) Passive RL: Evaluate a given policy πœ‹ Model-based approach Estimate 𝑃 and 𝑅 from sample trials Model-free approach Direct utility estimation TD learning Know exactly how the world works Don’t know how the world works Fei Fang 12/8/2018

3 Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation πœ–-Greedy Boltzmann policy Fei Fang 12/8/2018

4 Model-Based Active RL with Random Actions
Choose actions randomly Estimate 𝑃 and 𝑅 from sample trials (average counts) Use estimated 𝑃 and 𝑅 to compute estimate of optimal values and optimal policy Will the computed values and policy converge to the true optimal values and policy in the limit of infinite data? Sufficient condition: If all states are reachable from any other state Be able to visit each state and take each action as many times as you want Fei Fang 12/8/2018

5 Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation πœ–-Greedy Boltzmann policy Fei Fang 12/8/2018

6 Q-Value Recall Q-value
Similar to value function, but defined on state-action pair 𝑄 πœ‹ (𝑠,π‘Ž): expected total reward from state 𝑠 onward if taking action π‘Ž in state 𝑠, and follow policy πœ‹ afterward Bellman Equation given policy πœ‹: π‘ˆ πœ‹ 𝑠 =𝑅 𝑠 +𝛾 𝑠 β€² 𝑃 𝑠 β€² 𝑠,πœ‹(𝑠) π‘ˆ πœ‹ ( 𝑠 β€² ) Bellman optimality condition, i.e., for optimal policy πœ‹ βˆ— : π‘ˆ βˆ— 𝑠 =𝑅 𝑠 +𝛾 max π‘Ž 𝑠 β€² 𝑃 𝑠 β€² 𝑠,π‘Ž π‘ˆ βˆ— ( 𝑠 β€² ) That is 𝑄 πœ‹ 𝑠,π‘Ž =𝑅 𝑠 +𝛾 𝑠 β€² 𝑃 𝑠 β€² 𝑠,π‘Ž π‘ˆ πœ‹ ( 𝑠 β€² ) Obviously π‘ˆ πœ‹ 𝑠 = 𝑄 πœ‹ (𝑠,πœ‹ 𝑠 ) So 𝑄 πœ‹ 𝑠,π‘Ž =𝑅 𝑠 +𝛾 𝑠 β€² 𝑃 𝑠 β€² 𝑠,π‘Ž 𝑄 πœ‹ (𝑠′,πœ‹ 𝑠′ ) Fei Fang 12/8/2018

7 Optimal Q-Value Recall π‘ˆ πœ‹ 𝑠 =𝑅 𝑠 +𝛾 𝑠 β€² 𝑃 𝑠 β€² 𝑠,πœ‹(𝑠) π‘ˆ πœ‹ ( 𝑠 β€² ) And 𝑄 πœ‹ 𝑠,π‘Ž =𝑅 𝑠 +𝛾 𝑠 β€² 𝑃 𝑠 β€² 𝑠,π‘Ž π‘ˆ πœ‹ ( 𝑠 β€² ) And π‘ˆ πœ‹ 𝑠 = 𝑄 πœ‹ (𝑠,πœ‹ 𝑠 ) When using optimal policy πœ‹ βˆ— , we will take the action that leads to maximum total utility at each state Therefore (Or any probability distribution over actions with the highest 𝑄 βˆ— 𝑠,π‘Ž if there is a tie) πœ‹ βˆ— 𝑠 = argmax π‘Ž 𝑄 βˆ— 𝑠,π‘Ž We have π‘ˆ βˆ— 𝑠 = 𝑄 βˆ— 𝑠, πœ‹ βˆ— 𝑠 = max π‘Ž 𝑄 βˆ— (𝑠,π‘Ž) And 𝑄 βˆ— 𝑠,π‘Ž =𝑅 𝑠 +𝛾 𝑠 β€² 𝑃 𝑠 β€² 𝑠,π‘Ž π‘ˆ βˆ— ( 𝑠 β€² ) =𝑅 𝑠 +𝛾 𝑠 β€² 𝑃 𝑠 β€² 𝑠,π‘Ž max π‘Žβ€² 𝑄 βˆ— (𝑠′,π‘Žβ€²) Fei Fang 12/8/2018

8 Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation πœ–-Greedy Boltzmann policy Fei Fang 12/8/2018

9 SARSA Recall TD learning: Given πœ‹, estimate π‘ˆ πœ‹ (𝑠) through update
SARSA (State-Action-Reward-State-Action) Initialize policy πœ‹ Given πœ‹, estimate 𝑄 πœ‹ (𝑠,π‘Ž) through update Update policy based on 𝑄 πœ‹ 𝑠,π‘Ž (exploitation vs exploration) Repeat until πœ‹β€² converges (Is it guaranteed?) Similar to Policy Iteration On-policy algorithm, i.e., estimate the value of a policy while following that policy to choose actions π‘ˆ 𝑠 ← 1βˆ’π›Ό π‘ˆ 𝑠 +𝛼(π‘Ÿ+𝛾 π‘ˆ 𝑠 β€² ) 𝑄 πœ‹ 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 πœ‹ 𝑠,π‘Ž +𝛼(π‘Ÿ+𝛾 𝑄 πœ‹ 𝑠′,πœ‹( 𝑠 β€² ) ) (Full exploitation) πœ‹ β€² 𝑠 = max π‘Ž 𝑄 πœ‹ 𝑠,π‘Ž Fei Fang 12/8/2018

10 On-Policy vs Off-Policy Methods
Two types of RL approaches: On-policy methods attempt to evaluate or improve the policy that is used to make decisions Off-policy methods evaluate or improve a policy different from that used to generate the data Fei Fang 12/8/2018

11 Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation πœ–-Greedy Boltzmann policy Fei Fang 12/8/2018

12 Q-Learning Q-Learning: Similar to Value Iteration
Recall 𝑄 βˆ— 𝑠,π‘Ž =𝑅 𝑠 +𝛾 𝑠 β€² 𝑃 𝑠 β€² 𝑠,π‘Ž max π‘Žβ€² 𝑄 βˆ— (𝑠′,π‘Žβ€²) Q-Learning: Similar to Value Iteration Directly estimate 𝑄 βˆ— 𝑠,π‘Ž through update Given estimated value of 𝑄 βˆ— 𝑠,π‘Ž , derive estimate of optimal policy 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) πœ‹ βˆ— 𝑠 = max π‘Ž 𝑄 βˆ— 𝑠,π‘Ž Fei Fang 12/8/2018

13 Q-Learning Recall update 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) Q-Learning Initialize 𝑄 βˆ— (𝑠,π‘Ž) arbitrarily and 𝑄 βˆ— π‘‘π‘’π‘Ÿπ‘šπ‘–π‘›π‘Žπ‘™π‘ π‘‘π‘Žπ‘‘π‘’ 𝑠,𝑛𝑒𝑙𝑙 =0 Repeat (for each episode) Initialize state 𝑠 Repeat (for each step of episode) Choose action π‘Ž that is available at 𝑠 following some exploration policy Take action π‘Ž, observe reward π‘Ÿ, and next state 𝑠′ Update 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) 𝑠←𝑠′ Until 𝑠 is terminal state Observe reward π‘Ÿ of terminal state Update 𝑄 βˆ— 𝑠,𝑛𝑒𝑙𝑙 as 𝑄 βˆ— 𝑠,𝑛𝑒𝑙𝑙 =π‘Ÿ Until 𝐾 episodes/trials are run Return policy πœ‹ βˆ— 𝑠 = max π‘Ž 𝑄 βˆ— 𝑠,π‘Ž Off policy algorithm: the policy is being evaluated (estimation policy) is unrelated to the policy being followed (behavior policy) Fei Fang 12/8/2018

14 Q-Learning Example S1 S2 S3 S4 S5 S6: END a12 a23 a14 a21 a32 a36 a25
6 states, S1,..S6 12 actions π‘Ž 𝑖𝑗 Deterministic state transitions (but you don’t know this beforehand) 𝑅=100 in S6, 𝑅=0 otherwise (again, you don’t know this) Use 𝛾=0.5, 𝛼 = 1 Random behavior policy Fei Fang 12/8/2018

15 Q-Learning Example 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² 𝑄 βˆ— 𝑠 β€² , π‘Ž β€²
𝑅=100 in S6, 𝛾=0.5, 𝛼 = 1 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 𝑄(𝑆1, π‘Ž 12 ) 𝑄(𝑆1, π‘Ž 14 ) 𝑄(𝑆2, π‘Ž 21 ) 𝑄(𝑆2, π‘Ž 25 ) 𝑄(𝑆2, π‘Ž 23 ) 𝑄(𝑆3, π‘Ž 32 ) 𝑄(𝑆3, π‘Ž 36 ) 𝑄(𝑆4, π‘Ž 41 ) 𝑄(𝑆4, π‘Ž 45 ) 𝑄(𝑆5, π‘Ž 52 ) 𝑄(𝑆5, π‘Ž 54 ) 𝑄(𝑆5, π‘Ž 56 ) 𝑄(𝑆6,𝑛𝑒𝑙𝑙) Start at S1, available actions: π‘Ž 12 , π‘Ž 14 Probability of choosing each of them: 0.5 Choose π‘Ž 12 Get reward 0, get to state S2 Update state-value function 𝑄 βˆ— 𝑆1, π‘Ž 12 ← 1βˆ’π›Ό 𝑄 βˆ— 𝑆1, π‘Ž 12 +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² ∈ π‘Ž 21 , π‘Ž 23 , π‘Ž 𝑄 βˆ— 𝑆2, π‘Ž β€² =0 Fei Fang 12/8/2018

16 Q-Learning Example 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² 𝑄 βˆ— 𝑠 β€² , π‘Ž β€²
𝑅=100 in S6, 𝛾=0.5, 𝛼 = 1 𝑄(𝑆1, π‘Ž 12 ) 𝑄(𝑆1, π‘Ž 14 ) 𝑄(𝑆2, π‘Ž 21 ) 𝑄(𝑆2, π‘Ž 25 ) 𝑄(𝑆2, π‘Ž 23 ) 𝑄(𝑆3, π‘Ž 32 ) 𝑄(𝑆3, π‘Ž 36 ) 𝑄(𝑆4, π‘Ž 41 ) 𝑄(𝑆4, π‘Ž 45 ) 𝑄(𝑆5, π‘Ž 52 ) 𝑄(𝑆5, π‘Ž 54 ) 𝑄(𝑆5, π‘Ž 56 ) 𝑄(𝑆6,𝑛𝑒𝑙𝑙) S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 At S2, available actions: π‘Ž 21 , π‘Ž 23 , π‘Ž 25 Probability of choosing each of them: 1 3 Choose π‘Ž 23 Get reward 0, get to state S3 Update state-value function 𝑄 βˆ— 𝑆2, π‘Ž 23 ← 1βˆ’π›Ό 𝑄 βˆ— 𝑆2, π‘Ž 23 +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² ∈ π‘Ž 32 , π‘Ž 𝑄 βˆ— 𝑆3, π‘Ž β€² =0 Fei Fang 12/8/2018

17 Q-Learning Example 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² 𝑄 βˆ— 𝑠 β€² , π‘Ž β€²
𝑅=100 in S6, 𝛾=0.5, 𝛼 = 1 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 𝑄(𝑆1, π‘Ž 12 ) 𝑄(𝑆1, π‘Ž 14 ) 𝑄(𝑆2, π‘Ž 21 ) 𝑄(𝑆2, π‘Ž 25 ) 𝑄(𝑆2, π‘Ž 23 ) 𝑄(𝑆3, π‘Ž 32 ) 𝑄(𝑆3, π‘Ž 36 ) 𝑄(𝑆4, π‘Ž 41 ) 𝑄(𝑆4, π‘Ž 45 ) 𝑄(𝑆5, π‘Ž 52 ) 𝑄(𝑆5, π‘Ž 54 ) 𝑄(𝑆5, π‘Ž 56 ) 𝑄(𝑆6,𝑛𝑒𝑙𝑙) At S3, available actions: π‘Ž 32 , π‘Ž 36 Probability of choosing each of them: 0.5 Choose π‘Ž 36 Get reward 0, get to state S6 Update state-value function 𝑄 βˆ— 𝑆3, π‘Ž 36 ← 1βˆ’π›Ό 𝑄 βˆ— 𝑆3, π‘Ž 36 +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² ∈ 𝑛𝑒𝑙𝑙 𝑄 βˆ— 𝑆6, π‘Ž β€² =0 Fei Fang 12/8/2018

18 Q-Learning Example 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² 𝑄 βˆ— 𝑠 β€² , π‘Ž β€²
𝑅=100 in S6, 𝛾=0.5, 𝛼 = 1 𝑄(𝑆1, π‘Ž 12 ) 𝑄(𝑆1, π‘Ž 14 ) 𝑄(𝑆2, π‘Ž 21 ) 𝑄(𝑆2, π‘Ž 25 ) 𝑄(𝑆2, π‘Ž 23 ) 𝑄(𝑆3, π‘Ž 32 ) 𝑄(𝑆3, π‘Ž 36 ) 𝑄(𝑆4, π‘Ž 41 ) 𝑄(𝑆4, π‘Ž 45 ) 𝑄(𝑆5, π‘Ž 52 ) 𝑄(𝑆5, π‘Ž 54 ) 𝑄(𝑆5, π‘Ž 56 ) 𝑄(𝑆6,𝑛𝑒𝑙𝑙) β†’100 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 Terminal state, get reward 100, 𝑄 βˆ— 𝑆6,𝑛𝑒𝑙𝑙 ←100 Fei Fang 12/8/2018

19 Q-Learning Example 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² 𝑄 βˆ— 𝑠 β€² , π‘Ž β€²
𝑅=100 in S6, 𝛾=0.5, 𝛼 = 1 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 𝑄(𝑆1, π‘Ž 12 ) 𝑄(𝑆1, π‘Ž 14 ) 𝑄(𝑆2, π‘Ž 21 ) 𝑄(𝑆2, π‘Ž 25 ) 𝑄(𝑆2, π‘Ž 23 ) 𝑄(𝑆3, π‘Ž 32 ) 𝑄(𝑆3, π‘Ž 36 ) 𝑄(𝑆4, π‘Ž 41 ) 𝑄(𝑆4, π‘Ž 45 ) 𝑄(𝑆5, π‘Ž 52 ) 𝑄(𝑆5, π‘Ž 54 ) 𝑄(𝑆5, π‘Ž 56 ) 𝑄(𝑆6,𝑛𝑒𝑙𝑙) 100 Start a new episode! Start at S2, available actions: π‘Ž 21 , π‘Ž 23 , π‘Ž 25 Probability of choosing each of them: 1 3 Choose π‘Ž 23 Get reward 0, get to state S3 Update state-value function 𝑄 βˆ— 𝑆2, π‘Ž 23 ← 1βˆ’π›Ό 𝑄 βˆ— 𝑆2, π‘Ž 23 +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² ∈ π‘Ž 32 , π‘Ž 𝑄 βˆ— 𝑆3, π‘Ž β€² =0 Fei Fang 12/8/2018

20 Q-Learning Example 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² 𝑄 βˆ— 𝑠 β€² , π‘Ž β€²
𝑅=100 in S6, 𝛾=0.5, 𝛼 = 1 S1 S2 S3 S4 S5 S6: END a14 a41 a45 a54 a25 a52 a56 a21 a12 a23 a32 a36 𝑄(𝑆1, π‘Ž 12 ) 𝑄(𝑆1, π‘Ž 14 ) 𝑄(𝑆2, π‘Ž 21 ) 𝑄(𝑆2, π‘Ž 25 ) 𝑄(𝑆2, π‘Ž 23 ) 𝑄(𝑆3, π‘Ž 32 ) 𝑄(𝑆3, π‘Ž 36 ) β†’50 𝑄(𝑆4, π‘Ž 41 ) 𝑄(𝑆4, π‘Ž 45 ) 𝑄(𝑆5, π‘Ž 52 ) 𝑄(𝑆5, π‘Ž 54 ) 𝑄(𝑆5, π‘Ž 56 ) 𝑄(𝑆6,𝑛𝑒𝑙𝑙) 100 At S3, available actions: π‘Ž 32 , π‘Ž 36 Probability of choosing each of them: 0.5 Choose π‘Ž 36 Get reward 0, get to state S6 Update state-value function 𝑄 βˆ— 𝑆3, π‘Ž 36 ← 1βˆ’π›Ό 𝑄 βˆ— 𝑆3, π‘Ž 36 +𝛼 π‘Ÿ+𝛾 max π‘Ž β€² ∈ 𝑛𝑒𝑙𝑙 𝑄 βˆ— 𝑆6, π‘Ž β€² =50 Fei Fang 12/8/2018

21 Implication: Let 𝛼 decrease over time
Q-Learning Impact of 𝛼 Implication: Let 𝛼 decrease over time Fei Fang 12/8/2018

22 𝑅 𝑠 =0 for non-terminal states, 𝛾=0.9, 𝛼 𝑁 𝑠 =1/𝑁(𝑠)
Q-Learning Example Grid world example 𝑅 𝑠 =0 for non-terminal states, 𝛾=0.9, 𝛼 𝑁 𝑠 =1/𝑁(𝑠) 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό(𝑁 𝑠 ) 𝑄 βˆ— 𝑠,π‘Ž +𝛼(𝑁 𝑠 )(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) Update counter 𝑁(𝑠) before computing 𝛼 Fei Fang 12/8/2018

23 Start with Q-value being 0
Q-Learning Example Recall update 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό(𝑁 𝑠 ) 𝑄 βˆ— 𝑠,π‘Ž +𝛼(𝑁 𝑠 )(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) Start with Q-value being 0 Following uniform random strategy to select actions Trial 1: No meaningful update until reaching a terminal state since reward is 0 for non-terminal state Say the trial is 1,1 β†’ 1,2 β†’ 1,3 β†’ 2,3 β†’(3,3) Update 𝑄 3,3 ,𝑛𝑒𝑙𝑙 =1 Fei Fang 12/8/2018

24 Q-Learning Example After trial 1:
Recall update 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό(𝑁 𝑠 ) 𝑄 βˆ— 𝑠,π‘Ž +𝛼(𝑁 𝑠 )(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) After trial 1: Fei Fang 12/8/2018

25 Q-Learning Example Trial 2:
Recall update 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό(𝑁 𝑠 ) 𝑄 βˆ— 𝑠,π‘Ž +𝛼(𝑁 𝑠 )(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) Trial 2: Say the trial is 1,1 β†’ 1,2 β†’ 2,2 β†’ 1,2 β†’ 1,3 β†’ 2,3 β†’ 3,3 No meaningful update except for (2,3) Update 𝑄 2,3 ,π‘†π‘œπ‘’π‘‘β„Ž ← βˆ’0 =0.45 Fei Fang 12/8/2018

26 Q-Learning Example After trial 2:
Recall update 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό(𝑁 𝑠 ) 𝑄 βˆ— 𝑠,π‘Ž +𝛼(𝑁 𝑠 )(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) After trial 2: Fei Fang 12/8/2018

27 Q-Learning Example Trial 3:
Recall update 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό(𝑁 𝑠 ) 𝑄 βˆ— 𝑠,π‘Ž +𝛼(𝑁 𝑠 )(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) Trial 3: Start the trial: 1,1 β†’ 2,1 β†’ 1,1 β†’ 1,2 β†’ 1,3 Update 𝑄 1,3 ,π‘†π‘œπ‘’π‘‘β„Ž ← βˆ’0 =0.135 Continue trial β†’ 2,3 β†’(3,3) Update 𝑄 2,3 ,π‘†π‘œπ‘’π‘‘β„Ž =0.45+1/3(0+0.9(1) βˆ’0.45)=0.6 Fei Fang 12/8/2018

28 Q-Learning Example After trial 3:
Recall update 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό(𝑁 𝑠 ) 𝑄 βˆ— 𝑠,π‘Ž +𝛼(𝑁 𝑠 )(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) After trial 3: Fei Fang 12/8/2018

29 Q-Learning Properties
If acting randomly, Q-learning converges to optimal state-action values, and also therefore finds optimal policy Off-policy learning Can act in one way But learning values of another policy (the optimal one!) Acting randomly is sufficient, but not necessary, to learn the optimal values and policy Fei Fang 12/8/2018

30 Quiz 1 Is the following algorithm guaranteed to learn optimal policy? A: Yes B: No C: Not sure Some Algorithm Initialize 𝑄 βˆ— (𝑠,π‘Ž) arbitrarily and 𝑄 βˆ— π‘‘π‘’π‘Ÿπ‘šπ‘–π‘›π‘Žπ‘™π‘ π‘‘π‘Žπ‘‘π‘’ 𝑠,π‘›π‘œπ‘›π‘’ =0 Repeat (for each episode) Initialize state 𝑠 Repeat (for each step of episode) Choose action π‘Ž= argmax π‘Ž β€² 𝑄 βˆ— (𝑠,π‘Žβ€²) Take action π‘Ž, observe reward π‘Ÿ, and next state 𝑠′ Update 𝑄 βˆ— 𝑠,π‘Ž ← 1βˆ’π›Ό 𝑄 βˆ— 𝑠,π‘Ž +𝛼(π‘Ÿ+𝛾 max π‘Žβ€² 𝑄 βˆ— 𝑠′,π‘Žβ€² ) 𝑠←𝑠′ Until 𝑠 is terminal state Observe reward π‘Ÿ of terminal state Update 𝑄 βˆ— 𝑠,π‘›π‘œπ‘›π‘’ as 𝑄 βˆ— 𝑠,π‘›π‘œπ‘›π‘’ =π‘Ÿ Until 𝐾 episodes/trials are run Return policy πœ‹ βˆ— 𝑠 = max π‘Ž 𝑄 βˆ— 𝑠,π‘Ž Fei Fang 12/8/2018

31 Outline Active RL Model-based Active RL Q-Value Model-free Active RL
Model-based Active RL with random actions Q-Value Model-free Active RL SARSA Q-learning Exploration vs Exploitation πœ–-Greedy Boltzmann policy Fei Fang 12/8/2018

32 Exploration vs Exploitation
Fei Fang 12/8/2018

33 Simple Approach: πœ–-Greedy
With probability 1βˆ’πœ– Choose action π‘Ž= argmax π‘Ž β€² 𝑄 βˆ— (𝑠,π‘Žβ€²) With probability πœ– Select a random action For Q-learning Guaranteed to compute optimal policy πœ‹ βˆ— based on 𝑄 βˆ— (𝑠,π‘Ž) given enough samples with πœ–>0 However, the policy the agent is following is never the same as πœ‹ βˆ— (because it select a random action with probability πœ–) Fei Fang 12/8/2018

34 Simple Approach: πœ–-Greedy
With probability 1βˆ’πœ– Choose action π‘Ž= argmax π‘Ž β€² 𝑄 βˆ— (𝑠,π‘Žβ€²) With probability πœ– Select a random action For SARSA With fixed πœ–>0, may not converge to optimal policy πœ‹ βˆ— even if given enough samples Fei Fang 12/8/2018

35 Greedy in Limit of Infinite Exploration (GLIE)
πœ–-Greedy with decayed πœ– over time, e.g., πœ–= 1/ 𝑁(𝑠) Advantage: Eventually the agent will be following optimal policy almost all the time SARSA can converge to optimal policy given enough samples Fei Fang 12/8/2018

36 Random terminal state utility sampled from [0,1]
Impact of πœ– Random terminal state utility sampled from [0,1] Fei Fang 12/8/2018

37 Impact of πœ– Fei Fang 12/8/2018

38 Choose action π‘Ž with probability 𝑃(π‘Ž|𝑠)
Boltzmann Policy Choose action π‘Ž with probability 𝑃(π‘Ž|𝑠) Fei Fang 12/8/2018

39 Quiz 2 If we want that eventually the agent will be following optimal policy almost all the time when using Boltzmann policy to sample actions, how should the value of 𝜏 change as the learning progresses? A: Increase B: Decrease C: No change Fei Fang 12/8/2018

40 Summary Reinforcement Learning (RL) Active RL Model-free Active RL
Model-based Active RL SARSA, Q-Learning (with some exploratory policy) Estimate 𝑃 and 𝑅 through sampling Fei Fang 12/8/2018

41 SARSA vs Q-Learning SARSA Q-Learning Fei Fang 12/8/2018

42 Acknowledgment Some slides are borrowed from previous slides made by Tai Sing Lee and Zico Kolter, and some examples are borrowed from Meg Aycinena and Emma Brunskill Fei Fang 12/8/2018

43 http://courses.csail.mit.edu/6.825/fall05/rl_lecture/rl_e xamples.pdf
Other Resources xamples.pdf s16/www/slides/rl.pdf df Fei Fang 12/8/2018

44 Backup Slides Fei Fang 12/8/2018

45 Terminal States and Reward
What is terminal state and when do agent get reward You will see different formulations and different definitions of terminal state and reward In some formulation, a terminal state cannot have a reward In some formulation, a terminal state have a reward In some formulations, agent get the reward 𝑅(𝑠) every time it take an action from state 𝑠 In some formulations, agent get the reward 𝑅(𝑠′) every time it take an action from state 𝑠, and end up at state 𝑠′ Fei Fang 12/8/2018

46 Terminal States and Reward
In this lecture, we use the following formulation or interpretation Each state has a reward 𝑅(𝑠) or 𝑅(𝑠,π‘Ž) or 𝑅(𝑠,π‘Ž, 𝑠 β€² ), including the terminal state. For a terminal state 𝑠, the only available action is β€œnull”, and the 𝑠′ can only be β€œnull” (or β€œexited”) If at time 𝑑, the agent is at state 𝑠 and takes action π‘Ž, and observes that it ends up at state 𝑠′, and gets a reward of 𝑅(𝑠) or 𝑅(𝑠,π‘Ž) or 𝑅(𝑠,π‘Ž, 𝑠 β€² ). Then time counter increments by 1, i.e., now it is time 𝑑+1, the agent can now take an action starting from state 𝑠′ Why matter? For example, in Q-learning, you need to set Q value of a terminal state 𝑠 to be the reward you observe when you take the β€œnull” action When you read other books / exercise questions, pay attention to what their formulation is Fei Fang 12/8/2018

47 Terminal States and Reward
How to reduce a game in our formulation to other alternative formulations? We call the original game in this formulation game 𝐺, with reward function 𝑅, and we will create a new game 𝐺 in the alternative formulation, with reward function 𝑅 Fei Fang 12/8/2018

48 Terminal States and Reward
How to reduce a game in our formulation to other alternative formulations? (A terminal state cannot have a reward) Option 1: Create a new game 𝐺 , which has a new absorb state 𝑠 , all terminal states in the original game is linked to this absorb state through action β€œnull”, and only the absorb state is the terminal state, and it has reward 0 Why matter? For example, in Q-learning for game 𝐺 , you need to set Q value of a terminal state 𝑠 (which is the absorb state) to be 0 before you do any Q-value update, and never update the Q-value for the terminal state Fei Fang 12/8/2018

49 Terminal States and Reward
How to reduce a game in our formulation to other alternative formulations? (A terminal state cannot have a reward) Option 2: Create a new game 𝐺 , and the reward function is in the form 𝑅 (𝑠,π‘Ž, 𝑠 β€² ). When the agent take an action π‘Ž from state 𝑠, which reaches terminal state 𝑠′, the reward is 𝑅 𝑠,π‘Ž, 𝑠 β€² =𝑅 𝑠 +𝑅( 𝑠 β€² ) Fei Fang 12/8/2018

50 Terminal States and Reward
How to reduce a game in our formulation to other alternative formulations? (agent get the reward 𝑅(𝑠′) every time it take an action from state 𝑠, and end up at state 𝑠′) Create a new game 𝐺 , which has a new starting state 𝑠 , it is linked to all possible starting states through action β€œnull”, with uniform random transition probability Why matter? For example, in Q-learning for game 𝐺 , you need to set Q value of a terminal state 𝑠 to be 0 before you do any Q-value update, and never update the Q-value for the terminal state Fei Fang 12/8/2018


Download ppt "Instructors: Fei Fang (This Lecture) and Dave Touretzky"

Similar presentations


Ads by Google