Download presentation
1
Reinforcement Learning
Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)
2
What is RL? Trial & error learning Structure without model with model
3
RL vs. Supervised Learning
Evaluative vs. Instructional feedback Role of exploration On-line performance
4
K-armed Bandit Problem
Average Rewards Actions 10 0, 0, 5, 10, 35 5, 10, -15, -15, -10 -5 Agent 100
5
K-armed Bandit Cont. Greedy exploration ε-greedy Softmax
Average Reward: Incremental formula: where: α = 1 / (k+1) Probability of choosing action a:
6
More General Problems More than one state Delayed rewards
Markov Decision Process (MDP) Set of states Set of actions Reward function State transition function Table or Function Approximation
8
Example: Recycling Robot
9
Recycling Robot: Transition Graph
10
Dynamic Programming
11
Backup Diagram .25 .25 .25 .4 .6 .7 .3 .5 .5 Rewards 10 5 200 200 -10
1000
12
Dynamic Programming: Optimal Policy
13
Backup for Optimal Policy
14
Performance Metrics Eventual convergence to optimality
Speed of convergence to optimality Regret (Kaelbling, L., Littman, M., & Moore, A. 1996)
15
Gridworld Example
16
Initialize V arbitrarily, e.g. , for all
Repeat For each until (a small positive number) Output a deterministic policy, such that:
17
Temporal Difference Learning
RL without a model Issue of: temporal credit assignment Bootstraps like DP TD(0):
18
TD Learning Again, TD(0) = TD(λ) =
where e is called an eligibility trace
19
Backup Diagram for TD(λ)
20
TD-Gammon (Tesauro)
21
Additional Work POMDP’s Macros Multi-agent rl
Multiple reward structures
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.