Reinforcement Learning

1 Reinforcement Learning
Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

2 What is RL? Trial & error learning Structure without model with model

3 RL vs. Supervised Learning
Evaluative vs. Instructional feedback Role of exploration On-line performance

4 K-armed Bandit Problem
Average Rewards Actions 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 Agent 100

5 K-armed Bandit Cont. Greedy exploration ε-greedy Softmax
Average Reward: Incremental formula: where: α = 1 / (k+1) Probability of choosing action a:

6 More General Problems More than one state Delayed rewards
Markov Decision Process (MDP) Set of states Set of actions Reward function State transition function Table or Function Approximation


8 Example: Recycling Robot

9 Recycling Robot: Transition Graph

10 Dynamic Programming

11 Backup Diagram .25 .25 .25 .4 .6 .7 .3 .5 .5 Rewards 10 5 200 200 -10

12 Dynamic Programming: Optimal Policy

13 Backup for Optimal Policy

14 Performance Metrics Eventual convergence to optimality
Speed of convergence to optimality Regret (Kaelbling, L., Littman, M., & Moore, A. 1996)

15 Gridworld Example

16 Initialize V arbitrarily, e.g.           , for all        
Repeat For each       until         (a small positive number) Output a deterministic policy,   such that:

17 Temporal Difference Learning
RL without a model Issue of: temporal credit assignment Bootstraps like DP TD(0):

18 TD Learning Again, TD(0) = TD(λ) =
where e is called an eligibility trace

19 Backup Diagram for TD(λ)

20 TD-Gammon (Tesauro)

21 Additional Work POMDP’s Macros Multi-agent rl
Multiple reward structures

