Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction of Reinforcement Learning

Similar presentations


Presentation on theme: "Introduction of Reinforcement Learning"— Presentation transcript:

1 Introduction of Reinforcement Learning
Scratching the surface

2 Deep Reinforcement Learning

3 Reference Textbook: Reinforcement Learning: An Introduction
Lectures of David Silver ml (10 lectures, around 1:30 each) t_learning/ (Deep Reinforcement Learning ) Lectures of John Schulman John Schulman: open AI

4 Scenario of Reinforcement Learning
Observation Action State Change the environment Agent Don’t do that Reward Environment

5 Scenario of Reinforcement Learning
Agent learns to take actions maximizing expected reward. Observation Action State Change the environment Agent Thank you. Reward Environment

6 Machine Learning ≈ Looking for a Function
Actor/Policy Observation Action Action = π( Observation ) Function input Function output Used to pick the best function Reward Environment

7 Learning to play Go Observation Action Next Move Reward Environment

8 Learning to play Go Agent learns to take actions maximizing expected reward. Observation Action Reward reward = 0 in most cases If win, reward = 1 If loss, reward = -1 Environment

9 Learning to play Go Learning from teacher Supervised:
Reinforcement Learning Next move: “5-5” Next move: “3-3” Learning from experience 看著棋譜學 First move …… many moves …… Win! (Two agents play with each other.) Alpha Go is supervised learning + reinforcement learning.

10 Learning a chat-bot Machine obtains feedback from user
Machine obtains feedback from user Chat-bot learns to maximize the expected reward Hello How are you? Hi  Bye bye  Consider that you have a non-differentiable objective function This is cost! Ignore multiple turns here -10 3

11 Learning a chat-bot Let two agents talk to each other (sometimes generate good dialogue, sometimes bad) How old are you? How old are you? I am 16. See you. I though you were 12. See you. What make you think so? See you.

12 Machine learns from the evaluation
Learning a chat-bot By this approach, we can generate a lot of dialogues. Use some pre-defined rules to evaluate the goodness of a dialogue Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4 Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8 Machine learns from the evaluation Deep Reinforcement Learning for Dialogue Generation

13 Learning a chat-bot Supervised Reinforcement …… Bad “Hello” Say “Hi”
“Bye bye” Say “Good bye” …… ……. ……. Bad Hello  Agent …… Agent

14 More applications Flying Helicopter Driving Robot
Driving Robot Google Cuts Its Giant Electricity Bill With DeepMind- Powered AI giant-electricity-bill-with-deepmind-powered-ai Text generation

15 Example: Playing Video Game
Widely studies: Gym: Universe: Machine learns to play video games as human players What machine observes is pixels Machine learns to take proper action itself

16 Example: Playing Video Game
Termination: all the aliens are killed, or your spaceship is destroyed. Space invader Score (reward) Kill the aliens shield fire

17 Example: Playing Video Game
Space invader Play yourself: l How about machine: zx4HRyqgTCVk9ltw

18 Example: Playing Video Game
Start with observation 𝑠 1 Observation 𝑠 2 Observation 𝑠 3 Obtain reward 𝑟 1 =0 Obtain reward 𝑟 2 =5 Action 𝑎 1 : “right” Action 𝑎 2 : “fire” (kill an alien) Usually there is some randomness in the environment

19 Example: Playing Video Game
Start with observation 𝑠 1 Observation 𝑠 2 Observation 𝑠 3 This is an episode. After many turns Game Over (spaceship destroyed) Learn to maximize the expected cumulative reward per episode Obtain reward 𝑟 𝑇 Action 𝑎 𝑇

20 Properties of Reinforcement Learning
Reward delay In space invader, only “fire” obtains reward Although the moving before “fire” is important In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward Agent’s actions affect the subsequent data it receives E.g. Exploration 需竹

21 Outline Alpha Go: policy-based + value-based + model-based
Model-free Approach Policy-based Value-based Alpha Go: policy-based + value-based + model-based Learning an Actor Actor + Critic Learning a Critic Model-based Approach

22 Policy-based Approach
Learning an Actor

23 Three Steps for Deep Learning
Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Neural Network as Actor Deep Learning is so simple ……

24 Neural network as Actor
Input of neural network: the observation of machine represented as a vector or a matrix Output neural network : each action corresponds to a neuron in output layer NN as actor left 0.7 Probability of taking the action Taking the action with the highest score right 0.2 fire 0.1 pixels What is the benefit of using network instead of lookup table? generalization

25 Three Steps for Deep Learning
Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Neural Network as Actor Deep Learning is so simple ……

26 Given a set of parameters 𝜃
Goodness of Actor Total Loss: 𝐿= 𝑛=1 𝑁 𝑙 𝑛 Review: Supervised learning Training Example “1” …… y1 …… 1 As close as possible …… Given a set of parameters 𝜃 y2 Softmax target …… …… …… …… Loss 𝑙 …… y10

27 𝑅 𝜃 evaluates the goodness of an actor 𝜋 𝜃 𝑠
Goodness of Actor Given an actor 𝜋 𝜃 𝑠 with network parameter 𝜃 Use the actor 𝜋 𝜃 𝑠 to play the video game Start with observation 𝑠 1 Machine decides to take 𝑎 1 Machine obtains reward 𝑟 1 Machine sees observation 𝑠 2 Machine decides to take 𝑎 2 Machine obtains reward 𝑟 2 Machine sees observation 𝑠 3 …… Machine decides to take 𝑎 𝑇 Machine obtains reward 𝑟 𝑇 Total reward: 𝑅 𝜃 = 𝑡=1 𝑇 𝑟 𝑡 Even with the same actor, 𝑅 𝜃 is different each time Randomness in the actor and the game We define 𝑅 𝜃 as the expected value of R θ END 𝑅 𝜃 evaluates the goodness of an actor 𝜋 𝜃 𝑠

28 Goodness of Actor 𝑃 𝜏|𝜃 =
We define 𝑅 𝜃 as the expected value of R θ 𝜏= 𝑠 1 , 𝑎 1 , 𝑟 1 , 𝑠 2 , 𝑎 2 , 𝑟 2 ,⋯, 𝑠 𝑇 , 𝑎 𝑇 , 𝑟 𝑇 𝑃 𝜏|𝜃 = 𝑝 𝑠 1 𝑝 𝑎 1 | 𝑠 1 ,𝜃 𝑝 𝑟 1 , 𝑠 2 | 𝑠 1 , 𝑎 1 𝑝 𝑎 2 | 𝑠 2 ,𝜃 𝑝 𝑟 2 , 𝑠 3 | 𝑠 2 , 𝑎 2 ⋯ =𝑝 𝑠 1 𝑡=1 𝑇 𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 𝑝 𝑟 𝑡 , 𝑠 𝑡+1 | 𝑠 𝑡 , 𝑎 𝑡 𝑝 𝑎 𝑡 ="𝑓𝑖𝑟𝑒"| 𝑠 𝑡 ,𝜃 =0.7 left Actor 𝜋 𝜃 0.1 right not related to your actor Control by your actor 𝜋 𝜃 𝑠 𝑡 0.2 fire 0.7

29 Goodness of Actor An episode is considered as a trajectory 𝜏
𝜏= 𝑠 1 , 𝑎 1 , 𝑟 1 , 𝑠 2 , 𝑎 2 , 𝑟 2 ,⋯, 𝑠 𝑇 , 𝑎 𝑇 , 𝑟 𝑇 𝑅 𝜏 = 𝑡=1 𝑇 𝑟 𝑡 If you use an actor to play the game, each 𝜏 has a probability to be sampled The probability depends on actor parameter 𝜃: 𝑃 𝜏|𝜃 Why don’t we multiple P anymore ≈ 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 Use 𝜋 𝜃 to play the game N times, obtain 𝜏 1 , 𝜏 2 ,⋯, 𝜏 𝑁 = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝑅 𝜃 Sum over all possible trajectory Sampling 𝜏 from 𝑃 𝜏|𝜃 N times

30 Three Steps for Deep Learning
Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Neural Network as Actor Deep Learning is so simple ……

31 Gradient Ascent Problem statement Gradient ascent Start with 𝜃 0
𝜃 1 ← 𝜃 0 +𝜂𝛻 𝑅 𝜃 0 𝜃 2 ← 𝜃 1 +𝜂𝛻 𝑅 𝜃 1 …… 𝜃 ∗ =𝑎𝑟𝑔 max 𝜃 𝑅 𝜃 𝜃= 𝑤 1 , 𝑤 2 ,⋯, 𝑏 1 ,⋯ 𝜕 𝑅 𝜃 𝜕 𝑤 1 𝜕 𝑅 𝜃 𝜕 𝑤 2 ⋮ 𝜕 𝑅 𝜃 𝜕 𝑏 1 ⋮ 𝛻 𝑅 𝜃 =

32 Policy Gradient 𝑅 𝜃 = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻 𝑅 𝜃 =? 𝛻 𝑅 𝜃 = 𝜏 𝑅 𝜏 𝛻𝑃 𝜏|𝜃
𝑅 𝜃 = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻 𝑅 𝜃 =? 𝛻 𝑅 𝜃 = 𝜏 𝑅 𝜏 𝛻𝑃 𝜏|𝜃 = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻𝑃 𝜏|𝜃 𝑃 𝜏|𝜃 𝑅 𝜏 do not have to be differentiable It can even be a black box. = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 𝑑𝑙𝑜𝑔 𝑓 𝑥 𝑑𝑥 = 1 𝑓 𝑥 𝑑𝑓 𝑥 𝑑𝑥 ≈ 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑃 𝜏 𝑛 |𝜃 Use 𝜋 𝜃 to play the game N times, Obtain 𝜏 1 , 𝜏 2 ,⋯, 𝜏 𝑁

33 Policy Gradient 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 =?
𝜏= 𝑠 1 , 𝑎 1 , 𝑟 1 , 𝑠 2 , 𝑎 2 , 𝑟 2 ,⋯, 𝑠 𝑇 , 𝑎 𝑇 , 𝑟 𝑇 𝑃 𝜏|𝜃 =𝑝 𝑠 1 𝑡=1 𝑇 𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 𝑝 𝑟 𝑡 , 𝑠 𝑡+1 | 𝑠 𝑡 , 𝑎 𝑡 𝑙𝑜𝑔𝑃 𝜏|𝜃 =𝑙𝑜𝑔𝑝 𝑠 1 + 𝑡=1 𝑇 𝑙𝑜𝑔𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 +𝑙𝑜𝑔𝑝 𝑟 𝑡 , 𝑠 𝑡+1 | 𝑠 𝑡 , 𝑎 𝑡 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 = 𝑡=1 𝑇 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 Ignore the terms not related to 𝜃

34 Policy Gradient 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃
= 𝑡=1 𝑇 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 Policy Gradient 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 𝛻 𝑅 𝜃 ≈ 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑃 𝜏 𝑛 |𝜃 = 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 𝑡=1 𝑇 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 What if we replace 𝑅 𝜏 𝑛 with 𝑟 𝑡 𝑛 …… If in 𝜏 𝑛 machine takes 𝑎 𝑡 𝑛 when seeing 𝑠 𝑡 𝑛 in 𝑅 𝜏 𝑛 is positive Tuning 𝜃 to increase 𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 𝑅 𝜏 𝑛 is negative Tuning 𝜃 to decrease 𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 It is very important to consider the cumulative reward 𝑅 𝜏 𝑛 of the whole trajectory 𝜏 𝑛 instead of immediate reward 𝑟 𝑡 𝑛

35 Policy Gradient Update Model 𝜏 1 : 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 …… …… 𝜏 2 : …… …… Data
Given actor parameter 𝜃 𝜏 1 : 𝑠 1 1 , 𝑎 1 1 𝑅 𝜏 1 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝛻 𝑅 𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 𝑠 2 1 , 𝑎 2 1 𝑅 𝜏 1 …… …… 𝜏 2 : 𝑠 1 2 , 𝑎 1 2 𝑅 𝜏 2 𝑠 2 2 , 𝑎 2 2 𝑅 𝜏 2 …… …… Data Collection

36 Policy Gradient Considered as Classification Problem …
− 𝑖=1 3 𝑦 𝑖 𝑙𝑜𝑔 𝑦 𝑖 Considered as Classification Problem Minimize: 𝑦 𝑖 𝑦 𝑖 fire right left s 1 Maximize: 𝑙𝑜𝑔 𝑦 𝑖 = 𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠 𝜃←𝜃+𝜂𝛻𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠

37 Policy Gradient 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝜏 1 : …… 𝜏 2 : … 𝛻 𝑅 𝜃 =
𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝛻 𝑅 𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 𝜏 1 : 𝑠 1 1 , 𝑎 1 1 𝑅 𝜏 1 𝑠 2 1 , 𝑎 2 1 …… 𝜏 2 : 𝑠 1 2 , 𝑎 1 2 𝑅 𝜏 2 𝑠 2 2 , 𝑎 2 2 Given actor parameter 𝜃 𝑎 1 1 =𝑙𝑒𝑓𝑡 NN fire right left 1 𝑠 1 1 𝑎 1 2 =𝑓𝑖𝑟𝑒 1 NN fire right left 𝑠 1 2

38 Policy Gradient 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝜏 1 : 2 2 …… 𝜏 2 : 1 1 … 𝛻 𝑅 𝜃 =
𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝛻 𝑅 𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 𝜏 1 : 𝑠 1 1 , 𝑎 1 1 𝑅 𝜏 1 𝑠 2 1 , 𝑎 2 1 …… 𝜏 2 : 𝑠 1 2 , 𝑎 1 2 𝑅 𝜏 2 𝑠 2 2 , 𝑎 2 2 Given actor parameter 𝜃 2 Each training data is weighted by 𝑅 𝜏 𝑛 2 𝑠 1 1 NN 𝑎 1 1 =𝑙𝑒𝑓𝑡 1 𝑠 1 1 NN 𝑎 1 1 =𝑙𝑒𝑓𝑡 1 𝑠 1 2 NN 𝑎 1 2 =𝑓𝑖𝑟𝑒

39 Add a Baseline 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑
It is possible that 𝑅 𝜏 𝑛 is always positive. 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 𝛻 𝑅 𝜃 ≈ 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 −𝑏 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 It is probability … Ideal case a b c a b c Not sampled The probability of the actions not sampled will decrease. Sampling …… a b c a b c

40 Value-based Approach Learning a Critic Double Q Duel Q

41 Critic A critic does not determine the action.
Given an actor π, it evaluates the how good the actor is An actor can be found from a critic. e.g. Q-learning

42 Critic State value function 𝑉 𝜋 𝑠
When using actor 𝜋, the cumulated reward expects to be obtained after seeing observation (state) s 𝑉 𝜋 𝑉 𝜋 𝑠 (Lots of aliens to be killed, and the shields still exist) s scalar 𝑉 𝜋 𝑠 is large 𝑉 𝜋 𝑠 is smaller

43 Critic 𝑉 以前的阿光 大馬步飛 =bad 𝑉 變強的阿光 大馬步飛 =good

44 How to estimate 𝑉 𝜋 𝑠 Monte-Carlo based approach 𝑉 𝜋 𝑉 𝜋
The critic watches 𝜋 playing the game After seeing 𝑠 𝑎 , 𝑉 𝜋 𝑠 𝑎 Until the end of the episode, the cumulated reward is 𝐺 𝑎 𝑉 𝜋 𝑠 𝑎 𝐺 𝑎 After seeing 𝑠 𝑏 , 𝑉 𝜋 𝑠 𝑏 Until the end of the episode, the cumulated reward is 𝐺 𝑏 𝑉 𝜋 𝑠 𝑏 𝐺 𝑏

45 How to estimate 𝑉 𝜋 𝑠 Temporal-difference approach
⋯ 𝑠 𝑡 , 𝑎 𝑡 , 𝑟 𝑡 , 𝑠 𝑡+1 ⋯ 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡 + 𝑟 𝑡 = 𝑉 𝜋 𝑠 𝑡+1 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡 − 𝑉 𝜋 𝑠 𝑡+1 𝑟 𝑡 - The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an on-line, fully incremental fashion. With Monte Carlo methods one must wait until the end of an episode, because only then is the return known, whereas with TD methods one need wait only one time step. Surprisingly often this turns out to be a critical consideration Other applications are continuing tasks and have no episodes at all. 𝑉 𝜋 𝑠 𝑡+1 𝑉 𝜋 𝑠 𝑡+1 Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.

46 MC v.s. TD … 𝑉 𝜋 𝑉 𝜋 𝑠 𝑎 Larger variance 𝑉 𝜋 𝑠 𝑎 𝐺 𝑎 unbiased 𝑉 𝜋 𝑠 𝑡
𝑉 𝜋 𝑠 𝑎 𝐺 𝑎 unbiased 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡+1 𝑠 𝑡+1 𝑟+ Smaller variance May be biased

47 MC v.s. TD The critic has the following 8 episodes 𝑉 𝜋 𝑠 𝑏 =3/4
[Sutton, v2, Example 6.4] The critic has the following 8 episodes 𝑠 𝑎 ,𝑟=0, 𝑠 𝑏 ,𝑟=0, END 𝑠 𝑏 ,𝑟=1, END 𝑠 𝑏 ,𝑟=0, END 𝑉 𝜋 𝑠 𝑏 =3/4 𝑉 𝜋 𝑠 𝑎 =? 0? 3/4? Monte-Carlo: 𝑉 𝜋 𝑠 𝑎 =0 Temporal-difference: 𝑉 𝜋 𝑠 𝑏 +𝑟= 𝑉 𝜋 𝑠 𝑎 3/4 3/4 (The actions are ignored here.)

48 Another Critic State-action value function 𝑄 𝜋 𝑠,𝑎
When using actor 𝜋, the cumulated reward expects to be obtained after seeing observation s and taking a s 𝑄 𝜋 𝑄 𝜋 𝑄 𝜋 𝑠,𝑎=𝑙𝑒𝑓𝑡 𝑄 𝜋 𝑠,𝑎 s 𝑄 𝜋 𝑠,𝑎=𝑟𝑖𝑔ℎ𝑡 scalar a 𝑄 𝜋 𝑠,𝑎=𝑓𝑖𝑟𝑒 for discrete action only

49 Q-Learning 𝜋 interacts with the environment Learning 𝑄 𝜋 𝑠,𝑎
Find a new actor 𝜋 ′ “better” than 𝜋 𝜋= 𝜋 ′ TD or MC Add some noise ?

50 Q-Learning Given 𝑄 𝜋 𝑠,𝑎 , find a new actor 𝜋 ′ “better” than 𝜋
“Better”: 𝑉 𝜋 ′ 𝑠 ≥ 𝑉 𝜋 𝑠 , for all state s 𝜋 ′ 𝑠 =𝑎𝑟𝑔 max 𝑎 𝑄 𝜋 𝑠,𝑎 𝜋 ′ does not have extra parameters. It depends on Q Not suitable for continuous action a

51 Deep Reinforcement Learning
Actor-Critic Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods.

52 Actor-Critic 𝜋 interacts with the environment Learning 𝑄 𝜋 𝑠,𝑎 , 𝑉 𝜋 𝑠
Update actor from 𝜋→𝜋’ based on 𝑄 𝜋 𝑠,𝑎 , 𝑉 𝜋 𝑠 𝜋= 𝜋 ′ TD or MC

53 Actor-Critic Tips The parameters of actor 𝜋 𝑠 and critic 𝑉 𝜋 𝑠 can be shared Network left Network right 𝑠 fire Network 𝑉 𝜋 𝑠

54 Asynchronous +𝜂∆𝜃 𝜃 1 𝜃 2 ∆𝜃 𝜃 1 𝜃 1 ∆𝜃 1. Copy global parameters
Source of image: 𝜃 1 +𝜂∆𝜃 𝜃 2 1. Copy global parameters (other workers also update models) 2. Sampling some data ∆𝜃 3. Compute gradients 𝜃 1 4. Update global models 𝜃 1 李思叡 ∆𝜃

55 Demo of A3C Racing Car (DeepMind)
Racing Car (DeepMind)

56 Demo of A3C Visual Doom AI Competition @ CIG 2016

57 Concluding Remarks Model-free Approach Policy-based Value-based
Alpha Go: policy-based + value-based + model-based Learning an Actor Actor + Critic Learning a Critic Model-based Approach


Download ppt "Introduction of Reinforcement Learning"

Similar presentations


Ads by Google