Deep Reinforcement Learning

Slides:

Advertisements

Similar presentations

Logo YOUR TITLE GOES IN HERE AUTHORS AND AFFILIATIONS HERE

Advertisements

Your Department – Your Institution

TITLE. An engineer’s World

Logo This could be the poster title

Authors & Affiliations

Authors, e.g. Gokhan Gurbuz

PRESENTATION TITLE IN THIS SPACE HERE

Poster Number – Poster Title

Template for a 32” x 54” poster presentation (your title goes here)

Poster Title Template, Times 72 font

Template for a 3’ x 4’ poster presentation (your title goes here)

TITLE. An engineer’s World

Author Name. , Author Name

Wile E. Coyote, Super Genius

Place Title Here Header Header Header Header Header Header Header

TITLE TEXT MORE TEXT INFORMATION SLIDE

PROJECT TITLE / ACRONYM

Title of the Presentation

TITLE OF YOUR RESEARCH POSTER IN ALL CAPS Author Names (OPTIONAL: ACADEMIC DEPARTMENT, ) ABSTRACT Lorem ipsum dolor sit amet, consectetur adipiscing.

Template for a 4” x 8” poster presentation (your title goes here)

Image: Panorama of Plum Run Valley from Little Round Top, August 13, 2009 – Gettysburg Daily, , CC 3.0.

Poster Title (Arial 65pts Bold)

PRESENTATION TITLE IN THIS SPACE HERE

Template for a 43” x 54” poster presentation (your title goes here)

PRESENTATION TITLE IN THIS SPACE HERE

THIS IS SAMPLE TEXT. YOUR TEXT GOES HERE

TITLE OF YOUR RESEARCH POSTER IN ALL CAPS Author Names (OPTIONAL: ACADEMIC DEPARTMENT, ) ABSTRACT Lorem ipsum dolor sit amet, consectetur adipiscing.

Lines for Max. Impact and Size

Lines for Max. Impact and Size

How to use this template

Template for a 36” x 54” poster presentation (your title goes here)

Template for a 4” x 4” poster presentation (your title goes here)

Author Name. , Author Name

PROJECT IDEA ACRONYM Project Title Explaining Acronym Insert your

Template for a 27” x 54” poster presentation (your title goes here)

TITLE. An engineer’s World

TITLE OF YOUR RESEARCH POSTER IN ALL CAPS Author Names (OPTIONAL: ACADEMIC DEPARTMENT, ) ABSTRACT Lorem ipsum dolor sit amet, consectetur adipiscing.

Place Title Here Header Header Header Header Header Header Header

Community Partner Name(s) – Add logo on right

Öğrenci Adı Soyadı ve Numarası

Poster title (character size min. 60 pt)

TITLE. An engineer’s World

Authors, e.g. Gokhan Gurbuz

1 Full institution address 2 Full institution address

PRESENTATION TITLE IN THIS SPACE HERE

TITLE. An engineer’s World

PRESENTATION TITLE IN THIS SPACE HERE

Poster title (character size min. 60 pt)

Title line 1 Title Line 2 CONCLUSIONS INTRODUCTION RESULTS DISCUSSION

TITLE OF YOUR RESEARCH POSTER IN ALL CAPS Author Names (OPTIONAL: ACADEMIC DEPARTMENT, ) ABSTRACT Lorem ipsum dolor sit amet, consectetur adipiscing.

Title John Smith, TTE1SBN0

Place Title Here Header Header Header Header Header

PRESENTATION TITLE IN THIS SPACE HERE

Project Title Author(s) Introduction Results Discussion Methods

Poster´s title International

Poster Title Author #1, Author #2, Author #3

PRESENTATION TITLE IN THIS SPACE HERE

PRESENTATION TITLE IN THIS SPACE HERE

Project Title Author(s) Introduction Discussion Results Methods

Project Title Author(s) Background Results Discussion Hypothesis

PRESENTATION TITLE IN THIS SPACE HERE

Project Title Author(s) Background Results Discussion Hypothesis

Poster Title (Arial 65pts Bold)

PRESENTATION TITLE IN THIS SPACE HERE

PROJECT IDEA ACRONYM Project Title Explaining Acronym Insert your

Poster Title Template, Times 72 font

Title of Poster Your Name Heading 1 Heading 2 Heading 3 Heading 4

RealState Company Page #

Title of Poster Your Name Heading 1 Heading 2 Heading 3

PRESENTATION TITLE IN THIS SPACE HERE

Presentation transcript:

Deep Reinforcement Learning Aaron Schumacher Data Science DC 2017-11-14

Aaron Schumacher planspace.org has these slides

Plan applications: what theory applications: how onward

applications: what

Backgammon

Atari

Tetris

Go

theory

Yann LeCun’s cake cake: unsupervised learning icing: supervised learning cherry: reinforcement learning

unsupervised learning 𝑠

state 𝑠 NUMBERS Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras quis cursus purus. Aliquam nisl purus, posuere id venenatis et, posuere non leo. Proin commodo, arcu vel commodo pellentesque, erat sapien mattis justo, nec semper urna lacus vel nisi. Nullam nisi odio, interdum id dictum vel, ultricies quis leo. Aliquam sed porttitor lacus. Suspendisse vulputate, leo vitae tempus accumsan, justo ex hendrerit magna, vel suscipit eros purus non magna. Vestibulum ac interdum ipsum. Sed nibh sem, pharetra euismod purus non, finibus mattis sapien.

unsupervised (?) learning given 𝑠 learn 𝑠→cluster_id learn 𝑠→𝑠

unsupervised learning deep with deep neural nets 𝑠

supervised learning 𝑠→𝑎

action 𝑎 17.3 NUMBERS [2.0, 11.7, 5] “cat”/“dog” 4.2V “left”/“right”

supervised learning given 𝑠,𝑎 learn 𝑠→𝑎

supervised learning deep with deep neural nets 𝑠→𝑎

reinforcement learning 𝑟,𝑠→𝑎

reward 𝑟 NUMBER -3 1 7.4

optimal control / reinforcement learning

𝑟,𝑠→𝑎

⏰

𝑟′,𝑠′→𝑎′

⏰

reinforcement learning “given” 𝑟,𝑠,𝑎, 𝑟 ′ , 𝑠 ′ , 𝑎 ′ ,… learn “good” 𝑠→𝑎

Question Can you formulate supervised learning as reinforcement learning?

supervised as reinforcement? reward 0 for incorrect, reward 1 for correct beyond binary classification? reward deterministic next state random

Question Why is the Sarsa algorithm called Sarsa?

reinforcement learning 𝑟,𝑠→𝑎

reinforcement learning deep with deep neural nets 𝑟,𝑠→𝑎

Markov property 𝑠 is enough

Completely observable? Markov Chain MDP Markov Decision Process Make choices? no yes Completely observable? Markov Chain MDP Markov Decision Process HMM Hidden Markov Model POMDP Partially Observable MDP

usual RL problem elaboration

policy 𝜋 𝜋:𝑠→𝑎

return 𝑟 into the future (episodic or ongoing)

start goal

reward 1

return 1

Question How do we incentivize efficiency? start goal

negative reward at each time step -1

discounted rewards 𝛾 𝑡 𝑟 1

average rate of reward 𝑟 𝑡 1

value functions 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟

Question Does a value function specify a policy (if you want to maximize return)? 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟

trick question? 𝑣 𝜋 𝑞 𝜋

Question Does a value function specify a policy (if you want to maximize return)? 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟

environment dynamics 𝑠,𝑎→ 𝑟 ′ ,𝑠′

start goal Going “right” from “start,” where will you be next?

model 𝑠,𝑎→ 𝑟 ′ ,𝑠′

learning and acting

learn model of dynamics from experience difficulty varies

learn value function(s) with planning assuming we have the environment dynamics or a good model already

planning 𝑎 1 𝑠 1 𝑎 2

planning 𝑎 3 𝑠 2 𝑟 1 𝑎 1 𝑠 1 𝑎 4 𝑠 3 𝑎 2 𝑟 2 𝑎 5

planning 𝑎 3 𝑟 1 𝑠 2 Start estimating 𝑞(𝑠,𝑎)! 𝑎 1 𝑎 4 𝑠 1 𝑎 2 𝑠 3 𝑟 2 𝑎 5

planning 𝑎 6 𝑟 3 𝑎 3 𝑠 4 𝑟 1 𝑠 2 𝑎 1 𝑎 7 𝑎 4 𝑠 1 𝑠 5 𝑟 4 𝑎 2 𝑎 8 𝑠 3 𝑟 2 𝑠 6 𝑎 5 𝑟 5

planning 𝑎 6 𝑟 3 𝑎 3 𝑠 4 Start estimating 𝑣(𝑠)! 𝑟 1 𝑠 2 𝑎 1 𝑎 7 𝑎 4 𝑠 1 𝑠 5 𝑎 4 𝑟 4 𝑠 3 𝑎 2 𝑎 8 𝑟 2 𝑠 6 𝑎 5 𝑟 5

planning 𝑎 9 𝑠 7 𝑟 6 𝑎 6 𝑟 3 𝑠 4 𝑎 3 𝑎 10 𝑠 2 𝑠 8 𝑟 1 𝑟 7 𝑎 1 𝑎 7 𝑠 1 𝑠 5 𝑎 4 𝑎 11 𝑟 4 𝑠 3 𝑠 9 𝑎 2 𝑎 8 𝑟 2 𝑟 8 𝑠 6 𝑎 5 𝑎 12 𝑟 5

planning 𝑎 13 𝑟 9 𝑠 10 𝑎 9 𝑠 7 𝑟 6 𝑎 14 𝑎 6 𝑟 3 𝑠 4 𝑠 11 𝑟 10 𝑎 3 𝑎 10 𝑠 2 𝑠 8 𝑟 1 𝑟 7 𝑎 15 𝑎 1 𝑎 7 𝑠 1 𝑠 5 𝑠 12 𝑎 4 𝑎 11 𝑟 4 𝑟 11 𝑠 3 𝑠 9 𝑎 2 𝑎 8 𝑎 16 𝑟 2 𝑟 8 𝑠 6 𝑠 13 𝑎 5 𝑎 12 𝑟 5 𝑟 12

connections Dijkstra’s algorithm A* algorithm minimax search two-player games not a problem Monte Carlo tree search

roll-outs 𝑠 11 𝑟 10 𝑎 3 𝑎 10 𝑠 2 𝑠 8 𝑟 1 𝑟 7 𝑎 15 𝑎 1 𝑎 7 𝑠 1 𝑠 5 𝑎 4 𝑎 11 𝑟 4 𝑎 2 𝑎 8

roll-outs

everything can be stochastic

stochastic 𝑠′ “icy pond” -1 start goal -10 “icy pond” 20% chance you “slip” and go perpendicular to your intent

stochastic 𝑟′ multi-armed bandit

exploration vs. exploitation 𝜀-greedy policy usually do what looks best 𝜀 of the time, choose random action

Question How does randomness affect 𝜋, 𝑣, and 𝑞? 𝜋:𝑠→𝑎 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟

Question How does randomness affect 𝜋, 𝑣, and 𝑞? 𝜋:ℙ(𝑎|𝑠) 𝑣:𝑠→𝔼 𝑟 𝑣:𝑠→𝔼 𝑟 𝑞:𝑠,𝑎→𝔼 𝑟

model-free: Monte Carlo returns 𝑣 𝑠 = ? keep track and average like “planning” from experienced “roll-outs”

non-stationarity 𝑣(𝑠) changes over time!

moving average new mean = old mean + 𝛼(new sample - old mean)

moving average new mean = old mean + 𝛼(new sample - old mean) 𝛻 1 2 sample−mean 2

Question What’s the difference between 𝑣(𝑠) and 𝑣(𝑠′)? 𝑠′ 𝑠 … 𝑎 𝑟′

Question What’s the difference between 𝑣(𝑠) and 𝑣(𝑠′)? 𝑠′ 𝑠 … 𝑎 𝑟′

Bellman equation 𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ )

Bellman equation 𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ )

Bellman equation 𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ ) dynamic programming

Bellman equation 𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ ) dynamic programming curse of dimensionality

temporal difference (TD) 𝑣 𝑠 = 𝑟 ′ +𝑣( 𝑠 ′ ) 0=? 𝑟 ′ +𝑣 𝑠 ′ −𝑣(𝑠) “new sample”: 𝑟 ′ +𝑣 𝑠 ′ “old mean”: 𝑣(𝑠)

Q-learning new 𝑞 𝑠,𝑎 =𝑞 𝑠,𝑎 +𝛼( 𝑟 ′ +𝛾 max 𝑎 𝑞 𝑠 ′ ,𝑎 −𝑞 𝑠,𝑎 )

on-policy / off-policy

estimate 𝑣, 𝑞 with a deep neural network

back to the 𝜋 parameterize 𝜋 directly update based on how well it works REINFORCE REward Increment = Nonnegative Factor times Offset Reinforcement times Characteristic Eligibility

policy gradient 𝛻log⁡(𝜋 𝑎 𝑠 )

actor-critic 𝜋 is the actor 𝑣 is the critic train 𝜋 by policy gradient to encourage actions that work out better than 𝑣 expected

applications: how

Backgammon

TD-Gammon (1992) 𝑠 is custom features 𝑣 with shallow neural net 𝑟 is 1 for a win, 0 otherwise TD(𝜆) eligibility traces self-play shallow forward search

Atari

DQN (2015) 𝑠 is four frames of video 𝑞 with deep convolutional neural net 𝑟 is 1 if score increases, -1 if decreases, 0 otherwise Q-learning usually-frozen target network clipping update size etc. 𝜀-greedy experience replay

Tetris

evolution (2006) 𝑠 is custom features 𝜋 has a simple parameterization evaluate by final score cross-entropy method

Go

AlphaGo (2016) 𝑠 is custom features over geometry big and small variants 𝜋 𝑆𝐿 with deep convolutional neural net, supervised training 𝜋 𝑟𝑜𝑙𝑙𝑜𝑢𝑡 with smaller convolutional neural net, supervised training 𝑟 is 1 for a win, -1 for a loss, 0 otherwise 𝜋 𝑅𝐿 is 𝜋 𝑆𝐿 refined with policy gradient peer-play reinforcement learning 𝑣 with deep convolutional neural net, trained based on 𝜋 𝑅𝐿 games asynchronous policy and value Monte Carlo tree search expand tree with 𝜋 𝑆𝐿 evaluate positions with blend of 𝑣 and 𝜋 𝑟𝑜𝑙𝑙𝑜𝑢𝑡 rollouts 4 nets

AlphaGo Zero (2017) 𝑠 is simple features over time and geometry 𝜋,𝑣 with deep residual convolutional neural net 𝑟 is 1 for a win, -1 for a loss, 0 otherwise Monte Carlo tree search for self-play training and play 1 net

onward

Neural Architecture Search (NAS) policy gradient where the actions design a neural net reward is designed net’s validation set performance

language

robot control

self-driving cars interest results?

DotA 2

conclusion

reinforcement learning 𝑟,𝑠→𝑎 𝜋:𝑠→𝑎 𝑣:𝑠→ 𝑟 𝑞:𝑠,𝑎→ 𝑟

network architectures for deep RL feature engineering can still matter if using pixels often simpler than state-of-the-art for supervised don’t pool away location information if you need it consider using multiple/auxiliary outputs consider phrasing regression as classification room for big advancements

the lure and limits of RL seems like AI (?) needs so much data

Question Should you use reinforcement learning?

Thank you!

further resources planspace.org has these slides and links for all resources A Brief Survey of Deep Reinforcement Learning (paper) Karpathy’s Pong from Pixels (blog post) Reinforcement Learning: An Introduction (textbook) David Silver’s course (videos and slides) Deep Reinforcement Learning Bootcamp (videos, slides, and labs) OpenAI gym / baselines (software) National Go Center (physical place) Hack and Tell (fun meetup)