Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)
What is RL? Trial & error learning Structure without model with model
RL vs. Supervised Learning Evaluative vs. Instructional feedback Role of exploration On-line performance
K-armed Bandit Problem Average Rewards Actions 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 Agent 100
K-armed Bandit Cont. Greedy exploration ε-greedy Softmax Average Reward: Incremental formula: where: α = 1 / (k+1) Probability of choosing action a:
More General Problems More than one state Delayed rewards Markov Decision Process (MDP) Set of states Set of actions Reward function State transition function Table or Function Approximation
Example: Recycling Robot
Recycling Robot: Transition Graph
Dynamic Programming
Backup Diagram .25 .25 .25 .4 .6 .7 .3 .5 .5 Rewards 10 5 200 200 -10 1000
Dynamic Programming: Optimal Policy
Backup for Optimal Policy
Performance Metrics Eventual convergence to optimality Speed of convergence to optimality Regret (Kaelbling, L., Littman, M., & Moore, A. 1996)
Gridworld Example
Initialize V arbitrarily, e.g. , for all Repeat For each until (a small positive number) Output a deterministic policy, such that:
Temporal Difference Learning RL without a model Issue of: temporal credit assignment Bootstraps like DP TD(0):
TD Learning Again, TD(0) = TD(λ) = where e is called an eligibility trace
Backup Diagram for TD(λ)
TD-Gammon (Tesauro)
Additional Work POMDP’s Macros Multi-agent rl Multiple reward structures