Introduction of Reinforcement Learning

Introduction of Reinforcement Learning
Scratching the surface

Deep Reinforcement Learning

Reference Textbook: Reinforcement Learning: An Introduction
Lectures of David Silver ml (10 lectures, around 1:30 each) t_learning/ (Deep Reinforcement Learning ) Lectures of John Schulman John Schulman: open AI

Scenario of Reinforcement Learning
Observation Action State Change the environment Agent Don’t do that Reward Environment

Scenario of Reinforcement Learning
Agent learns to take actions maximizing expected reward. Observation Action State Change the environment Agent Thank you. Reward Environment

Machine Learning ≈ Looking for a Function
Actor/Policy Observation Action Action = π( Observation ) Function input Function output Used to pick the best function Reward Environment

Learning to play Go Observation Action Next Move Reward Environment

Learning to play Go Agent learns to take actions maximizing expected reward. Observation Action Reward reward = 0 in most cases If win, reward = 1 If loss, reward = -1 Environment

Learning to play Go Learning from teacher Supervised:
Reinforcement Learning Next move: “5-5” Next move: “3-3” Learning from experience 看著棋譜學 First move …… many moves …… Win! (Two agents play with each other.) Alpha Go is supervised learning + reinforcement learning.

Learning a chat-bot Machine obtains feedback from user
Machine obtains feedback from user Chat-bot learns to maximize the expected reward Hello How are you? Hi  Bye bye  Consider that you have a non-differentiable objective function This is cost! Ignore multiple turns here -10 3

Learning a chat-bot Let two agents talk to each other (sometimes generate good dialogue, sometimes bad) How old are you? How old are you? I am 16. See you. I though you were 12. See you. What make you think so? See you.

Machine learns from the evaluation
Learning a chat-bot By this approach, we can generate a lot of dialogues. Use some pre-defined rules to evaluate the goodness of a dialogue Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4 Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8 Machine learns from the evaluation Deep Reinforcement Learning for Dialogue Generation

Learning a chat-bot Supervised Reinforcement …… Bad “Hello” Say “Hi”
“Bye bye” Say “Good bye” …… ……. ……. Bad Hello  Agent …… Agent

More applications Flying Helicopter Driving Robot
Driving Robot Google Cuts Its Giant Electricity Bill With DeepMind- Powered AI giant-electricity-bill-with-deepmind-powered-ai Text generation

Example: Playing Video Game
Widely studies: Gym: Universe: Machine learns to play video games as human players What machine observes is pixels Machine learns to take proper action itself

Termination: all the aliens are killed, or your spaceship is destroyed. Space invader Score (reward) Kill the aliens shield fire

Space invader Play yourself: l How about machine: zx4HRyqgTCVk9ltw

Start with observation 𝑠 1 Observation 𝑠 2 Observation 𝑠 3 Obtain reward 𝑟 1 =0 Obtain reward 𝑟 2 =5 Action 𝑎 1 : “right” Action 𝑎 2 : “fire” (kill an alien) Usually there is some randomness in the environment

Start with observation 𝑠 1 Observation 𝑠 2 Observation 𝑠 3 This is an episode. After many turns Game Over (spaceship destroyed) Learn to maximize the expected cumulative reward per episode Obtain reward 𝑟 𝑇 Action 𝑎 𝑇

Properties of Reinforcement Learning
Reward delay In space invader, only “fire” obtains reward Although the moving before “fire” is important In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward Agent’s actions affect the subsequent data it receives E.g. Exploration 需竹

Outline Alpha Go: policy-based + value-based + model-based
Model-free Approach Policy-based Value-based Alpha Go: policy-based + value-based + model-based Learning an Actor Actor + Critic Learning a Critic Model-based Approach

Policy-based Approach
Learning an Actor

Three Steps for Deep Learning
Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Neural Network as Actor Deep Learning is so simple ……

Neural network as Actor
Input of neural network: the observation of machine represented as a vector or a matrix Output neural network : each action corresponds to a neuron in output layer NN as actor left 0.7 Probability of taking the action Taking the action with the highest score … right 0.2 … fire 0.1 pixels What is the benefit of using network instead of lookup table? generalization

Given a set of parameters 𝜃
Goodness of Actor Total Loss: 𝐿= 𝑛=1 𝑁 𝑙 𝑛 Review: Supervised learning Training Example “1” …… y1 …… 1 As close as possible …… Given a set of parameters 𝜃 y2 Softmax target …… …… …… …… Loss 𝑙 …… y10

𝑅 𝜃 evaluates the goodness of an actor 𝜋 𝜃 𝑠
Goodness of Actor Given an actor 𝜋 𝜃 𝑠 with network parameter 𝜃 Use the actor 𝜋 𝜃 𝑠 to play the video game Start with observation 𝑠 1 Machine decides to take 𝑎 1 Machine obtains reward 𝑟 1 Machine sees observation 𝑠 2 Machine decides to take 𝑎 2 Machine obtains reward 𝑟 2 Machine sees observation 𝑠 3 …… Machine decides to take 𝑎 𝑇 Machine obtains reward 𝑟 𝑇 Total reward: 𝑅 𝜃 = 𝑡=1 𝑇 𝑟 𝑡 Even with the same actor, 𝑅 𝜃 is different each time Randomness in the actor and the game We define 𝑅 𝜃 as the expected value of R θ END 𝑅 𝜃 evaluates the goodness of an actor 𝜋 𝜃 𝑠

Goodness of Actor An episode is considered as a trajectory 𝜏
𝜏= 𝑠 1 , 𝑎 1 , 𝑟 1 , 𝑠 2 , 𝑎 2 , 𝑟 2 ,⋯, 𝑠 𝑇 , 𝑎 𝑇 , 𝑟 𝑇 𝑅 𝜏 = 𝑡=1 𝑇 𝑟 𝑡 If you use an actor to play the game, each 𝜏 has a probability to be sampled The probability depends on actor parameter 𝜃: 𝑃 𝜏|𝜃 Why don’t we multiple P anymore ≈ 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 Use 𝜋 𝜃 to play the game N times, obtain 𝜏 1 , 𝜏 2 ,⋯, 𝜏 𝑁 = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝑅 𝜃 Sum over all possible trajectory Sampling 𝜏 from 𝑃 𝜏|𝜃 N times

Gradient Ascent Problem statement Gradient ascent Start with 𝜃 0
𝜃 1 ← 𝜃 0 +𝜂𝛻 𝑅 𝜃 0 𝜃 2 ← 𝜃 1 +𝜂𝛻 𝑅 𝜃 1 …… 𝜃 ∗ =𝑎𝑟𝑔 max 𝜃 𝑅 𝜃 𝜃= 𝑤 1 , 𝑤 2 ,⋯, 𝑏 1 ,⋯ 𝜕 𝑅 𝜃 𝜕 𝑤 1 𝜕 𝑅 𝜃 𝜕 𝑤 2 ⋮ 𝜕 𝑅 𝜃 𝜕 𝑏 1 ⋮ 𝛻 𝑅 𝜃 =

Policy Gradient 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃
= 𝑡=1 𝑇 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 Policy Gradient 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 𝛻 𝑅 𝜃 ≈ 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑃 𝜏 𝑛 |𝜃 = 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 𝑡=1 𝑇 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 What if we replace 𝑅 𝜏 𝑛 with 𝑟 𝑡 𝑛 …… If in 𝜏 𝑛 machine takes 𝑎 𝑡 𝑛 when seeing 𝑠 𝑡 𝑛 in 𝑅 𝜏 𝑛 is positive Tuning 𝜃 to increase 𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 𝑅 𝜏 𝑛 is negative Tuning 𝜃 to decrease 𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 It is very important to consider the cumulative reward 𝑅 𝜏 𝑛 of the whole trajectory 𝜏 𝑛 instead of immediate reward 𝑟 𝑡 𝑛

Policy Gradient Update Model 𝜏 1 : 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 …… …… 𝜏 2 : …… …… Data
Given actor parameter 𝜃 𝜏 1 : 𝑠 1 1 , 𝑎 1 1 𝑅 𝜏 1 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝛻 𝑅 𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 𝑠 2 1 , 𝑎 2 1 𝑅 𝜏 1 …… …… 𝜏 2 : 𝑠 1 2 , 𝑎 1 2 𝑅 𝜏 2 𝑠 2 2 , 𝑎 2 2 𝑅 𝜏 2 …… …… Data Collection

Policy Gradient Considered as Classification Problem …
− 𝑖=1 3 𝑦 𝑖 𝑙𝑜𝑔 𝑦 𝑖 Considered as Classification Problem Minimize: 𝑦 𝑖 𝑦 𝑖 … fire right left s 1 Maximize: 𝑙𝑜𝑔 𝑦 𝑖 = 𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠 𝜃←𝜃+𝜂𝛻𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠

Policy Gradient 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝜏 1 : …… 𝜏 2 : … 𝛻 𝑅 𝜃 =
𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝛻 𝑅 𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 𝜏 1 : 𝑠 1 1 , 𝑎 1 1 𝑅 𝜏 1 𝑠 2 1 , 𝑎 2 1 …… 𝜏 2 : 𝑠 1 2 , 𝑎 1 2 𝑅 𝜏 2 𝑠 2 2 , 𝑎 2 2 Given actor parameter 𝜃 𝑎 1 1 =𝑙𝑒𝑓𝑡 NN fire right left 1 𝑠 1 1 … 𝑎 1 2 =𝑓𝑖𝑟𝑒 1 NN fire right left 𝑠 1 2

Policy Gradient 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝜏 1 : 2 2 …… 𝜏 2 : 1 1 … 𝛻 𝑅 𝜃 =
𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝛻 𝑅 𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 𝜏 1 : 𝑠 1 1 , 𝑎 1 1 𝑅 𝜏 1 𝑠 2 1 , 𝑎 2 1 …… 𝜏 2 : 𝑠 1 2 , 𝑎 1 2 𝑅 𝜏 2 𝑠 2 2 , 𝑎 2 2 Given actor parameter 𝜃 2 Each training data is weighted by 𝑅 𝜏 𝑛 2 𝑠 1 1 NN 𝑎 1 1 =𝑙𝑒𝑓𝑡 1 𝑠 1 1 NN 𝑎 1 1 =𝑙𝑒𝑓𝑡 1 … 𝑠 1 2 NN 𝑎 1 2 =𝑓𝑖𝑟𝑒

Add a Baseline 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑
It is possible that 𝑅 𝜏 𝑛 is always positive. 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 𝛻 𝑅 𝜃 ≈ 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 −𝑏 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 It is probability … Ideal case a b c a b c Not sampled The probability of the actions not sampled will decrease. Sampling …… a b c a b c

Value-based Approach Learning a Critic Double Q Duel Q

Critic A critic does not determine the action.
Given an actor π, it evaluates the how good the actor is An actor can be found from a critic. e.g. Q-learning

Critic State value function 𝑉 𝜋 𝑠
When using actor 𝜋, the cumulated reward expects to be obtained after seeing observation (state) s 𝑉 𝜋 𝑉 𝜋 𝑠 (Lots of aliens to be killed, and the shields still exist) s scalar 𝑉 𝜋 𝑠 is large 𝑉 𝜋 𝑠 is smaller

Critic 𝑉 以前的阿光大馬步飛 =bad 𝑉 變強的阿光大馬步飛 =good

How to estimate 𝑉 𝜋 𝑠 Monte-Carlo based approach 𝑉 𝜋 𝑉 𝜋
The critic watches 𝜋 playing the game After seeing 𝑠 𝑎 , 𝑉 𝜋 𝑠 𝑎 Until the end of the episode, the cumulated reward is 𝐺 𝑎 𝑉 𝜋 𝑠 𝑎 𝐺 𝑎 After seeing 𝑠 𝑏 , 𝑉 𝜋 𝑠 𝑏 Until the end of the episode, the cumulated reward is 𝐺 𝑏 𝑉 𝜋 𝑠 𝑏 𝐺 𝑏

How to estimate 𝑉 𝜋 𝑠 Temporal-difference approach
⋯ 𝑠 𝑡 , 𝑎 𝑡 , 𝑟 𝑡 , 𝑠 𝑡+1 ⋯ 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡 + 𝑟 𝑡 = 𝑉 𝜋 𝑠 𝑡+1 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡 − 𝑉 𝜋 𝑠 𝑡+1 𝑟 𝑡 - The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an on-line, fully incremental fashion. With Monte Carlo methods one must wait until the end of an episode, because only then is the return known, whereas with TD methods one need wait only one time step. Surprisingly often this turns out to be a critical consideration Other applications are continuing tasks and have no episodes at all. 𝑉 𝜋 𝑠 𝑡+1 𝑉 𝜋 𝑠 𝑡+1 Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.

MC v.s. TD … 𝑉 𝜋 𝑉 𝜋 𝑠 𝑎 Larger variance 𝑉 𝜋 𝑠 𝑎 𝐺 𝑎 unbiased 𝑉 𝜋 𝑠 𝑡
𝑉 𝜋 𝑠 𝑎 𝐺 𝑎 unbiased 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡+1 𝑠 𝑡+1 𝑟+ Smaller variance May be biased

MC v.s. TD The critic has the following 8 episodes 𝑉 𝜋 𝑠 𝑏 =3/4
[Sutton, v2, Example 6.4] The critic has the following 8 episodes 𝑠 𝑎 ,𝑟=0, 𝑠 𝑏 ,𝑟=0, END 𝑠 𝑏 ,𝑟=1, END 𝑠 𝑏 ,𝑟=0, END 𝑉 𝜋 𝑠 𝑏 =3/4 𝑉 𝜋 𝑠 𝑎 =? 0? 3/4? Monte-Carlo: 𝑉 𝜋 𝑠 𝑎 =0 Temporal-difference: 𝑉 𝜋 𝑠 𝑏 +𝑟= 𝑉 𝜋 𝑠 𝑎 3/4 3/4 (The actions are ignored here.)

Another Critic State-action value function 𝑄 𝜋 𝑠,𝑎
When using actor 𝜋, the cumulated reward expects to be obtained after seeing observation s and taking a s 𝑄 𝜋 𝑄 𝜋 𝑄 𝜋 𝑠,𝑎=𝑙𝑒𝑓𝑡 𝑄 𝜋 𝑠,𝑎 s 𝑄 𝜋 𝑠,𝑎=𝑟𝑖𝑔ℎ𝑡 scalar a 𝑄 𝜋 𝑠,𝑎=𝑓𝑖𝑟𝑒 for discrete action only

Q-Learning 𝜋 interacts with the environment Learning 𝑄 𝜋 𝑠,𝑎
Find a new actor 𝜋 ′ “better” than 𝜋 𝜋= 𝜋 ′ TD or MC Add some noise ?

Q-Learning Given 𝑄 𝜋 𝑠,𝑎 , find a new actor 𝜋 ′ “better” than 𝜋
“Better”: 𝑉 𝜋 ′ 𝑠 ≥ 𝑉 𝜋 𝑠 , for all state s 𝜋 ′ 𝑠 =𝑎𝑟𝑔 max 𝑎 𝑄 𝜋 𝑠,𝑎 𝜋 ′ does not have extra parameters. It depends on Q Not suitable for continuous action a

Deep Reinforcement Learning
Actor-Critic Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods.

Actor-Critic 𝜋 interacts with the environment Learning 𝑄 𝜋 𝑠,𝑎 , 𝑉 𝜋 𝑠
Update actor from 𝜋→𝜋’ based on 𝑄 𝜋 𝑠,𝑎 , 𝑉 𝜋 𝑠 𝜋= 𝜋 ′ TD or MC

Actor-Critic Tips The parameters of actor 𝜋 𝑠 and critic 𝑉 𝜋 𝑠 can be shared Network left Network right 𝑠 fire Network 𝑉 𝜋 𝑠

Asynchronous +𝜂∆𝜃 𝜃 1 𝜃 2 ∆𝜃 𝜃 1 𝜃 1 ∆𝜃 1. Copy global parameters
Source of image: 𝜃 1 +𝜂∆𝜃 𝜃 2 1. Copy global parameters (other workers also update models) 2. Sampling some data ∆𝜃 3. Compute gradients 𝜃 1 4. Update global models 𝜃 1 李思叡 ∆𝜃

Demo of A3C Racing Car (DeepMind)
Racing Car (DeepMind)

Demo of A3C Visual Doom AI Competition @ CIG 2016

Concluding Remarks Model-free Approach Policy-based Value-based
Alpha Go: policy-based + value-based + model-based Learning an Actor Actor + Critic Learning a Critic Model-based Approach

Introduction of Reinforcement Learning

Similar presentations

Presentation on theme: "Introduction of Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction of Reinforcement Learning

Similar presentations

Presentation on theme: "Introduction of Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback