Download presentation
Presentation is loading. Please wait.
1
Deep Reinforcement Learning
Aaron Schumacher Data Science DC
2
Aaron Schumacher planspace.org has these slides
6
Plan applications: what theory applications: how onward
7
applications: what
8
Backgammon
9
Atari
10
Tetris
11
Go
12
theory
13
Yann LeCun’s cake cake: unsupervised learning
icing: supervised learning cherry: reinforcement learning
14
unsupervised learning
𝑠
15
state 𝑠 NUMBERS Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras quis cursus purus. Aliquam nisl purus, posuere id venenatis et, posuere non leo. Proin commodo, arcu vel commodo pellentesque, erat sapien mattis justo, nec semper urna lacus vel nisi. Nullam nisi odio, interdum id dictum vel, ultricies quis leo. Aliquam sed porttitor lacus. Suspendisse vulputate, leo vitae tempus accumsan, justo ex hendrerit magna, vel suscipit eros purus non magna. Vestibulum ac interdum ipsum. Sed nibh sem, pharetra euismod purus non, finibus mattis sapien.
16
unsupervised (?) learning
given 𝑠 learn 𝑠→cluster_id learn 𝑠→𝑠
17
unsupervised learning
deep with deep neural nets 𝑠
18
supervised learning 𝑠→𝑎
19
action 𝑎 17.3 NUMBERS [2.0, 11.7, 5] “cat”/“dog” 4.2V “left”/“right”
20
supervised learning given 𝑠,𝑎 learn 𝑠→𝑎
21
supervised learning deep with deep neural nets 𝑠→𝑎
22
reinforcement learning
𝑟,𝑠→𝑎
23
reward 𝑟 NUMBER -3 1 7.4
24
optimal control / reinforcement learning
25
𝑟,𝑠→𝑎
26
⏰
27
𝑟′,𝑠′→𝑎′
28
⏰
30
reinforcement learning
“given” 𝑟,𝑠,𝑎, 𝑟 ′ , 𝑠 ′ , 𝑎 ′ ,… learn “good” 𝑠→𝑎
31
Question Can you formulate supervised learning as reinforcement learning?
32
supervised as reinforcement?
reward 0 for incorrect, reward 1 for correct beyond binary classification? reward deterministic next state random
33
Question Why is the Sarsa algorithm called Sarsa?
34
reinforcement learning
𝑟,𝑠→𝑎
35
reinforcement learning
deep with deep neural nets 𝑟,𝑠→𝑎
36
Markov property 𝑠 is enough
37
Completely observable? Markov Chain MDP Markov Decision Process
Make choices? no yes Completely observable? Markov Chain MDP Markov Decision Process HMM Hidden Markov Model POMDP Partially Observable MDP
38
usual RL problem elaboration
39
policy 𝜋 𝜋:𝑠→𝑎
40
return 𝑟 into the future (episodic or ongoing)
41
start goal
42
reward 1
43
return 1
44
Question How do we incentivize efficiency? start goal
45
negative reward at each time step
-1
46
discounted rewards 𝛾 𝑡 𝑟 1
47
average rate of reward 𝑟 𝑡 1
48
value functions 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟
49
Question Does a value function specify a policy (if you want to maximize return)? 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟
50
trick question? 𝑣 𝜋 𝑞 𝜋
51
Question Does a value function specify a policy (if you want to maximize return)? 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟
52
environment dynamics 𝑠,𝑎→ 𝑟 ′ ,𝑠′
53
start goal Going “right” from “start,” where will you be next?
54
model 𝑠,𝑎→ 𝑟 ′ ,𝑠′
55
learning and acting
56
learn model of dynamics
from experience difficulty varies
57
learn value function(s) with planning
assuming we have the environment dynamics or a good model already
58
planning 𝑎 1 𝑠 1 𝑎 2
59
planning 𝑎 3 𝑠 2 𝑟 1 𝑎 1 𝑠 1 𝑎 4 𝑠 3 𝑎 2 𝑟 2 𝑎 5
60
planning 𝑎 3 𝑟 1 𝑠 2 Start estimating 𝑞(𝑠,𝑎)! 𝑎 1 𝑎 4 𝑠 1 𝑎 2 𝑠 3 𝑟 2
𝑎 5
61
planning 𝑎 6 𝑟 3 𝑎 3 𝑠 4 𝑟 1 𝑠 2 𝑎 1 𝑎 7 𝑎 4 𝑠 1 𝑠 5 𝑟 4 𝑎 2 𝑎 8 𝑠 3
𝑟 2 𝑠 6 𝑎 5 𝑟 5
62
planning 𝑎 6 𝑟 3 𝑎 3 𝑠 4 Start estimating 𝑣(𝑠)! 𝑟 1 𝑠 2 𝑎 1 𝑎 7 𝑎 4
𝑠 1 𝑠 5 𝑎 4 𝑟 4 𝑠 3 𝑎 2 𝑎 8 𝑟 2 𝑠 6 𝑎 5 𝑟 5
63
planning 𝑎 9 𝑠 7 𝑟 6 𝑎 6 𝑟 3 𝑠 4 𝑎 3 𝑎 10 𝑠 2 𝑠 8 𝑟 1 𝑟 7 𝑎 1 𝑎 7 𝑠 1
𝑠 5 𝑎 4 𝑎 11 𝑟 4 𝑠 3 𝑠 9 𝑎 2 𝑎 8 𝑟 2 𝑟 8 𝑠 6 𝑎 5 𝑎 12 𝑟 5
64
planning 𝑎 13 𝑟 9 𝑠 10 𝑎 9 𝑠 7 𝑟 6 𝑎 14 𝑎 6 𝑟 3 𝑠 4 𝑠 11 𝑟 10 𝑎 3 𝑎 10
𝑠 2 𝑠 8 𝑟 1 𝑟 7 𝑎 15 𝑎 1 𝑎 7 𝑠 1 𝑠 5 𝑠 12 𝑎 4 𝑎 11 𝑟 4 𝑟 11 𝑠 3 𝑠 9 𝑎 2 𝑎 8 𝑎 16 𝑟 2 𝑟 8 𝑠 6 𝑠 13 𝑎 5 𝑎 12 𝑟 5 𝑟 12
65
connections Dijkstra’s algorithm A* algorithm minimax search
two-player games not a problem Monte Carlo tree search
66
roll-outs 𝑠 11 𝑟 10 𝑎 3 𝑎 10 𝑠 2 𝑠 8 𝑟 1 𝑟 7 𝑎 15 𝑎 1 𝑎 7 𝑠 1 𝑠 5 𝑎 4
𝑎 11 𝑟 4 𝑎 2 𝑎 8
67
roll-outs
68
everything can be stochastic
69
stochastic 𝑠′ “icy pond”
-1 start goal -10 “icy pond” 20% chance you “slip” and go perpendicular to your intent
70
stochastic 𝑟′ multi-armed bandit
71
exploration vs. exploitation
𝜀-greedy policy usually do what looks best 𝜀 of the time, choose random action
72
Question How does randomness affect 𝜋, 𝑣, and 𝑞? 𝜋:𝑠→𝑎 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟
73
Question How does randomness affect 𝜋, 𝑣, and 𝑞? 𝜋:ℙ(𝑎|𝑠) 𝑣:𝑠→𝔼 𝑟
𝑣:𝑠→𝔼 𝑟 𝑞:𝑠,𝑎→𝔼 𝑟
74
model-free: Monte Carlo returns
𝑣 𝑠 = ? keep track and average like “planning” from experienced “roll-outs”
75
non-stationarity 𝑣(𝑠) changes over time!
76
moving average new mean = old mean + 𝛼(new sample - old mean)
77
moving average new mean = old mean + 𝛼(new sample - old mean)
𝛻 sample−mean 2
78
Question What’s the difference between 𝑣(𝑠) and 𝑣(𝑠′)? 𝑠′ 𝑠 … 𝑎 𝑟′
79
Question What’s the difference between 𝑣(𝑠) and 𝑣(𝑠′)? 𝑠′ 𝑠 … 𝑎 𝑟′
80
Bellman equation 𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ )
81
Bellman equation 𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ )
82
Bellman equation 𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ ) dynamic programming
83
Bellman equation 𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ ) dynamic programming
curse of dimensionality
84
temporal difference (TD)
𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ ) 0=? 𝑟 ′ +𝑣 𝑠 ′ −𝑣(𝑠) “new sample”: 𝑟 ′ +𝑣 𝑠 ′ “old mean”: 𝑣(𝑠)
85
Q-learning new 𝑞 𝑠,𝑎 =𝑞 𝑠,𝑎 +𝛼( 𝑟 ′ +𝛾 max 𝑎 𝑞 𝑠 ′ ,𝑎 −𝑞 𝑠,𝑎 )
86
on-policy / off-policy
87
estimate 𝑣, 𝑞 with a deep neural network
88
back to the 𝜋 parameterize 𝜋 directly
update based on how well it works REINFORCE REward Increment = Nonnegative Factor times Offset Reinforcement times Characteristic Eligibility
89
policy gradient 𝛻log(𝜋 𝑎 𝑠 )
90
actor-critic 𝜋 is the actor 𝑣 is the critic
train 𝜋 by policy gradient to encourage actions that work out better than 𝑣 expected
91
applications: how
92
Backgammon
93
TD-Gammon (1992) 𝑠 is custom features 𝑣 with shallow neural net
𝑟 is 1 for a win, 0 otherwise TD(𝜆) eligibility traces self-play shallow forward search
94
Atari
95
DQN (2015) 𝑠 is four frames of video
𝑞 with deep convolutional neural net 𝑟 is 1 if score increases, -1 if decreases, 0 otherwise Q-learning usually-frozen target network clipping update size etc. 𝜀-greedy experience replay
96
Tetris
97
evolution (2006) 𝑠 is custom features 𝜋 has a simple parameterization
evaluate by final score cross-entropy method
98
Go
99
AlphaGo (2016) 𝑠 is custom features over geometry
big and small variants 𝜋 𝑆𝐿 with deep convolutional neural net, supervised training 𝜋 𝑟𝑜𝑙𝑙𝑜𝑢𝑡 with smaller convolutional neural net, supervised training 𝑟 is 1 for a win, -1 for a loss, 0 otherwise 𝜋 𝑅𝐿 is 𝜋 𝑆𝐿 refined with policy gradient peer-play reinforcement learning 𝑣 with deep convolutional neural net, trained based on 𝜋 𝑅𝐿 games asynchronous policy and value Monte Carlo tree search expand tree with 𝜋 𝑆𝐿 evaluate positions with blend of 𝑣 and 𝜋 𝑟𝑜𝑙𝑙𝑜𝑢𝑡 rollouts 4 nets
100
AlphaGo Zero (2017) 𝑠 is simple features over time and geometry
𝜋,𝑣 with deep residual convolutional neural net 𝑟 is 1 for a win, -1 for a loss, 0 otherwise Monte Carlo tree search for self-play training and play 1 net
101
onward
102
Neural Architecture Search (NAS)
policy gradient where the actions design a neural net reward is designed net’s validation set performance
103
language
106
robot control
107
self-driving cars interest results?
108
DotA 2
109
conclusion
111
reinforcement learning
𝑟,𝑠→𝑎 𝜋:𝑠→𝑎 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟
112
network architectures for deep RL
feature engineering can still matter if using pixels often simpler than state-of-the-art for supervised don’t pool away location information if you need it consider using multiple/auxiliary outputs consider phrasing regression as classification room for big advancements
113
the lure and limits of RL
seems like AI (?) needs so much data
114
Question Should you use reinforcement learning?
115
Thank you!
116
further resources planspace.org has these slides and links for all resources A Brief Survey of Deep Reinforcement Learning (paper) Karpathy’s Pong from Pixels (blog post) Reinforcement Learning: An Introduction (textbook) David Silver’s course (videos and slides) Deep Reinforcement Learning Bootcamp (videos, slides, and labs) OpenAI gym / baselines (software) National Go Center (physical place) Hack and Tell (fun meetup)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.