Introduction of Reinforcement Learning

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Reinforcement Learning
Lecture 18: Temporal-Difference Learning
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Fuzzy Reinforcement Learning Agents By Ritesh Kanetkar Systems and Industrial Engineering Lab Presentation May 23, 2003.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
Stochastic tree search and stochastic games
Deep Feedforward Networks
Reinforcement Learning
Reinforcement Learning
Deep Reinforcement Learning
Classification: Logistic Regression
Adversarial Learning for Neural Dialogue Generation
A Crash Course in Reinforcement Learning
Chapter 6: Temporal Difference Learning
Mastering the game of Go with deep neural network and tree search
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Deep Learning Hung-yi Lee 李宏毅.
Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
Status Report on Machine Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning
An Overview of Reinforcement Learning
Deep reinforcement learning
Reinforcement Learning
Classification with Perceptrons Reading:
CH. 1: Introduction 1.1 What is Machine Learning Example:
Artificial Neural Networks
Reinforcement learning with unsupervised auxiliary tasks
CS 4/527: Artificial Intelligence
"Playing Atari with deep reinforcement learning."
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Training a Neural Network
Logistic Regression & Parallel SGD
Dr. Unnikrishnan P.C. Professor, EEE
Word Embedding Word2Vec.
یادگیری تقویتی Reinforcement Learning
Double Dueling Agent for Dialogue Policy Learning
Reinforcement Learning
Deep Learning for Non-Linear Control
Chapter 8: Generalization and Function Approximation
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 1: Introduction
Deep Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Chapter 7: Eligibility Traces
Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.
Introduction to Neural Networks
Distributed Reinforcement Learning for Multi-Robot Decentralized Collective Construction Gyu-Young Hwang
EE 193/Comp 150 Computing with Biological Parts
Reinforcement Learning
Reinforcement Learning (2)
Morteza Kheirkhah University College London
Presentation transcript:

Introduction of Reinforcement Learning Scratching the surface

Deep Reinforcement Learning https://deepmind.com/research/dqn/

Reference Textbook: Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/the-book.html Lectures of David Silver http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.ht ml (10 lectures, around 1:30 each) http://videolectures.net/rldm2015_silver_reinforcemen t_learning/ (Deep Reinforcement Learning ) Lectures of John Schulman https://youtu.be/aUrX-rP_ss4 John Schulman: open AI

Scenario of Reinforcement Learning Observation Action State Change the environment Agent Don’t do that Reward Environment

Scenario of Reinforcement Learning Agent learns to take actions maximizing expected reward. Observation Action State Change the environment Agent Thank you. Reward Environment https://yoast.com/how-to-clean-site-structure/

Machine Learning ≈ Looking for a Function Actor/Policy Observation Action Action = π( Observation ) Function input Function output Used to pick the best function Reward Environment

Learning to play Go Observation Action Next Move Reward Environment

Learning to play Go Agent learns to take actions maximizing expected reward. Observation Action Reward reward = 0 in most cases If win, reward = 1 If loss, reward = -1 Environment

Learning to play Go Learning from teacher Supervised: Reinforcement Learning Next move: “5-5” Next move: “3-3” Learning from experience 看著棋譜學 First move …… many moves …… Win! (Two agents play with each other.) Alpha Go is supervised learning + reinforcement learning.

Learning a chat-bot Machine obtains feedback from user https://image.freepik.com/free-vector/variety-of-human-avatars_23-2147506285.jpg http://www.freepik.com/free-vector/variety-of-human-avatars_766615.htm Machine obtains feedback from user Chat-bot learns to maximize the expected reward Hello How are you? Hi  Bye bye  Consider that you have a non-differentiable objective function This is cost! Ignore multiple turns here -10 3

Learning a chat-bot Let two agents talk to each other (sometimes generate good dialogue, sometimes bad) How old are you? How old are you? I am 16. See you. I though you were 12. See you. What make you think so? See you.

Machine learns from the evaluation Learning a chat-bot By this approach, we can generate a lot of dialogues. Use some pre-defined rules to evaluate the goodness of a dialogue Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4 Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8 Machine learns from the evaluation Deep Reinforcement Learning for Dialogue Generation https://arxiv.org/pdf/1606.01541v3.pdf

Learning a chat-bot Supervised Reinforcement …… Bad “Hello” Say “Hi” “Bye bye” Say “Good bye” …… ……. ……. Bad Hello  Agent …… Agent

More applications Flying Helicopter Driving Robot https://www.youtube.com/watch?v=0JL04JJjocc Driving https://www.youtube.com/watch?v=0xo1Ldx3L5Q Robot https://www.youtube.com/watch?v=370cT-OAzzM Google Cuts Its Giant Electricity Bill With DeepMind- Powered AI http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its- giant-electricity-bill-with-deepmind-powered-ai Text generation https://www.youtube.com/watch?v=pbQ4qe8EwLo

Example: Playing Video Game Widely studies: Gym: https://gym.openai.com/ Universe: https://openai.com/blog/universe/ Machine learns to play video games as human players What machine observes is pixels Machine learns to take proper action itself

Example: Playing Video Game Termination: all the aliens are killed, or your spaceship is destroyed. Space invader Score (reward) Kill the aliens shield fire

Example: Playing Video Game Space invader Play yourself: http://www.2600online.com/spaceinvaders.htm l How about machine: https://gym.openai.com/evaluations/eval_Eduo zx4HRyqgTCVk9ltw

Example: Playing Video Game Start with observation 𝑠 1 Observation 𝑠 2 Observation 𝑠 3 Obtain reward 𝑟 1 =0 Obtain reward 𝑟 2 =5 Action 𝑎 1 : “right” Action 𝑎 2 : “fire” (kill an alien) Usually there is some randomness in the environment

Example: Playing Video Game Start with observation 𝑠 1 Observation 𝑠 2 Observation 𝑠 3 This is an episode. After many turns Game Over (spaceship destroyed) Learn to maximize the expected cumulative reward per episode Obtain reward 𝑟 𝑇 Action 𝑎 𝑇

Properties of Reinforcement Learning Reward delay In space invader, only “fire” obtains reward Although the moving before “fire” is important In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward Agent’s actions affect the subsequent data it receives E.g. Exploration 需竹

Outline Alpha Go: policy-based + value-based + model-based Model-free Approach Policy-based Value-based Alpha Go: policy-based + value-based + model-based Learning an Actor Actor + Critic Learning a Critic Model-based Approach

Policy-based Approach Learning an Actor

Three Steps for Deep Learning Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Neural Network as Actor Deep Learning is so simple …… http://www.twword.com/wiki/%E8%85%A6%E6%88%B6%E7%A9%B4 http://ukenglish.pixnet.net/blog/post/105691160-%E3%80%90%E5%8F%B0%E5%8D%97%E5%B8%82%E5%AD%B8%E8%8B%B1%E8%AA%9E%EF%BC%8C%E5%84%AA%E9%85%B7%E8%8B%B1%E8%AA%9E%E6%96%87%E7%90%86%E5%85%AC%E5%91%8A%E3%80%91%E6%9C%AC%E4%B8%AD%E5%BF%83 http://onepiece1234567890.blogspot.tw/2013/12/blog-post_8.html

Neural network as Actor Input of neural network: the observation of machine represented as a vector or a matrix Output neural network : each action corresponds to a neuron in output layer NN as actor left 0.7 Probability of taking the action Taking the action with the highest score … right 0.2 … fire 0.1 pixels What is the benefit of using network instead of lookup table? generalization

Three Steps for Deep Learning Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Neural Network as Actor Deep Learning is so simple …… http://www.twword.com/wiki/%E8%85%A6%E6%88%B6%E7%A9%B4 http://ukenglish.pixnet.net/blog/post/105691160-%E3%80%90%E5%8F%B0%E5%8D%97%E5%B8%82%E5%AD%B8%E8%8B%B1%E8%AA%9E%EF%BC%8C%E5%84%AA%E9%85%B7%E8%8B%B1%E8%AA%9E%E6%96%87%E7%90%86%E5%85%AC%E5%91%8A%E3%80%91%E6%9C%AC%E4%B8%AD%E5%BF%83 http://onepiece1234567890.blogspot.tw/2013/12/blog-post_8.html

Given a set of parameters 𝜃 Goodness of Actor Total Loss: 𝐿= 𝑛=1 𝑁 𝑙 𝑛 Review: Supervised learning Training Example “1” …… y1 …… 1 As close as possible …… Given a set of parameters 𝜃 y2 Softmax target …… …… …… …… Loss 𝑙 …… y10

𝑅 𝜃 evaluates the goodness of an actor 𝜋 𝜃 𝑠 Goodness of Actor Given an actor 𝜋 𝜃 𝑠 with network parameter 𝜃 Use the actor 𝜋 𝜃 𝑠 to play the video game Start with observation 𝑠 1 Machine decides to take 𝑎 1 Machine obtains reward 𝑟 1 Machine sees observation 𝑠 2 Machine decides to take 𝑎 2 Machine obtains reward 𝑟 2 Machine sees observation 𝑠 3 …… Machine decides to take 𝑎 𝑇 Machine obtains reward 𝑟 𝑇 Total reward: 𝑅 𝜃 = 𝑡=1 𝑇 𝑟 𝑡 Even with the same actor, 𝑅 𝜃 is different each time Randomness in the actor and the game We define 𝑅 𝜃 as the expected value of R θ END 𝑅 𝜃 evaluates the goodness of an actor 𝜋 𝜃 𝑠

Goodness of Actor 𝑃 𝜏|𝜃 = We define 𝑅 𝜃 as the expected value of R θ 𝜏= 𝑠 1 , 𝑎 1 , 𝑟 1 , 𝑠 2 , 𝑎 2 , 𝑟 2 ,⋯, 𝑠 𝑇 , 𝑎 𝑇 , 𝑟 𝑇 𝑃 𝜏|𝜃 = 𝑝 𝑠 1 𝑝 𝑎 1 | 𝑠 1 ,𝜃 𝑝 𝑟 1 , 𝑠 2 | 𝑠 1 , 𝑎 1 𝑝 𝑎 2 | 𝑠 2 ,𝜃 𝑝 𝑟 2 , 𝑠 3 | 𝑠 2 , 𝑎 2 ⋯ =𝑝 𝑠 1 𝑡=1 𝑇 𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 𝑝 𝑟 𝑡 , 𝑠 𝑡+1 | 𝑠 𝑡 , 𝑎 𝑡 𝑝 𝑎 𝑡 ="𝑓𝑖𝑟𝑒"| 𝑠 𝑡 ,𝜃 =0.7 left Actor 𝜋 𝜃 0.1 right not related to your actor Control by your actor 𝜋 𝜃 𝑠 𝑡 0.2 fire 0.7

Goodness of Actor An episode is considered as a trajectory 𝜏 𝜏= 𝑠 1 , 𝑎 1 , 𝑟 1 , 𝑠 2 , 𝑎 2 , 𝑟 2 ,⋯, 𝑠 𝑇 , 𝑎 𝑇 , 𝑟 𝑇 𝑅 𝜏 = 𝑡=1 𝑇 𝑟 𝑡 If you use an actor to play the game, each 𝜏 has a probability to be sampled The probability depends on actor parameter 𝜃: 𝑃 𝜏|𝜃 Why don’t we multiple P anymore ≈ 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 Use 𝜋 𝜃 to play the game N times, obtain 𝜏 1 , 𝜏 2 ,⋯, 𝜏 𝑁 = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝑅 𝜃 Sum over all possible trajectory Sampling 𝜏 from 𝑃 𝜏|𝜃 N times

Three Steps for Deep Learning Step 1: define a set of function Step 2: goodness of function Step 3: pick the best function Neural Network as Actor Deep Learning is so simple …… http://www.twword.com/wiki/%E8%85%A6%E6%88%B6%E7%A9%B4 http://ukenglish.pixnet.net/blog/post/105691160-%E3%80%90%E5%8F%B0%E5%8D%97%E5%B8%82%E5%AD%B8%E8%8B%B1%E8%AA%9E%EF%BC%8C%E5%84%AA%E9%85%B7%E8%8B%B1%E8%AA%9E%E6%96%87%E7%90%86%E5%85%AC%E5%91%8A%E3%80%91%E6%9C%AC%E4%B8%AD%E5%BF%83 http://onepiece1234567890.blogspot.tw/2013/12/blog-post_8.html

Gradient Ascent Problem statement Gradient ascent Start with 𝜃 0 𝜃 1 ← 𝜃 0 +𝜂𝛻 𝑅 𝜃 0 𝜃 2 ← 𝜃 1 +𝜂𝛻 𝑅 𝜃 1 …… 𝜃 ∗ =𝑎𝑟𝑔 max 𝜃 𝑅 𝜃 𝜃= 𝑤 1 , 𝑤 2 ,⋯, 𝑏 1 ,⋯ 𝜕 𝑅 𝜃 𝜕 𝑤 1 𝜕 𝑅 𝜃 𝜕 𝑤 2 ⋮ 𝜕 𝑅 𝜃 𝜕 𝑏 1 ⋮ 𝛻 𝑅 𝜃 =

Policy Gradient 𝑅 𝜃 = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻 𝑅 𝜃 =? 𝛻 𝑅 𝜃 = 𝜏 𝑅 𝜏 𝛻𝑃 𝜏|𝜃 𝑅 𝜃 = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻 𝑅 𝜃 =? 𝛻 𝑅 𝜃 = 𝜏 𝑅 𝜏 𝛻𝑃 𝜏|𝜃 = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻𝑃 𝜏|𝜃 𝑃 𝜏|𝜃 𝑅 𝜏 do not have to be differentiable It can even be a black box. = 𝜏 𝑅 𝜏 𝑃 𝜏|𝜃 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 𝑑𝑙𝑜𝑔 𝑓 𝑥 𝑑𝑥 = 1 𝑓 𝑥 𝑑𝑓 𝑥 𝑑𝑥 ≈ 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑃 𝜏 𝑛 |𝜃 Use 𝜋 𝜃 to play the game N times, Obtain 𝜏 1 , 𝜏 2 ,⋯, 𝜏 𝑁

Policy Gradient 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 =? 𝜏= 𝑠 1 , 𝑎 1 , 𝑟 1 , 𝑠 2 , 𝑎 2 , 𝑟 2 ,⋯, 𝑠 𝑇 , 𝑎 𝑇 , 𝑟 𝑇 𝑃 𝜏|𝜃 =𝑝 𝑠 1 𝑡=1 𝑇 𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 𝑝 𝑟 𝑡 , 𝑠 𝑡+1 | 𝑠 𝑡 , 𝑎 𝑡 𝑙𝑜𝑔𝑃 𝜏|𝜃 =𝑙𝑜𝑔𝑝 𝑠 1 + 𝑡=1 𝑇 𝑙𝑜𝑔𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 +𝑙𝑜𝑔𝑝 𝑟 𝑡 , 𝑠 𝑡+1 | 𝑠 𝑡 , 𝑎 𝑡 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 = 𝑡=1 𝑇 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 Ignore the terms not related to 𝜃

Policy Gradient 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 𝛻𝑙𝑜𝑔𝑃 𝜏|𝜃 = 𝑡=1 𝑇 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 | 𝑠 𝑡 ,𝜃 Policy Gradient 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 𝛻 𝑅 𝜃 ≈ 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑃 𝜏 𝑛 |𝜃 = 1 𝑁 𝑛=1 𝑁 𝑅 𝜏 𝑛 𝑡=1 𝑇 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 What if we replace 𝑅 𝜏 𝑛 with 𝑟 𝑡 𝑛 …… If in 𝜏 𝑛 machine takes 𝑎 𝑡 𝑛 when seeing 𝑠 𝑡 𝑛 in 𝑅 𝜏 𝑛 is positive Tuning 𝜃 to increase 𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 𝑅 𝜏 𝑛 is negative Tuning 𝜃 to decrease 𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 It is very important to consider the cumulative reward 𝑅 𝜏 𝑛 of the whole trajectory 𝜏 𝑛 instead of immediate reward 𝑟 𝑡 𝑛

Policy Gradient Update Model 𝜏 1 : 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 …… …… 𝜏 2 : …… …… Data Given actor parameter 𝜃 𝜏 1 : 𝑠 1 1 , 𝑎 1 1 𝑅 𝜏 1 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝛻 𝑅 𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 𝑠 2 1 , 𝑎 2 1 𝑅 𝜏 1 …… …… 𝜏 2 : 𝑠 1 2 , 𝑎 1 2 𝑅 𝜏 2 𝑠 2 2 , 𝑎 2 2 𝑅 𝜏 2 …… …… Data Collection

Policy Gradient Considered as Classification Problem … − 𝑖=1 3 𝑦 𝑖 𝑙𝑜𝑔 𝑦 𝑖 Considered as Classification Problem Minimize: 𝑦 𝑖 𝑦 𝑖 … fire right left s 1 Maximize: 𝑙𝑜𝑔 𝑦 𝑖 = 𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠 𝜃←𝜃+𝜂𝛻𝑙𝑜𝑔𝑃 "𝑙𝑒𝑓𝑡"|𝑠

Policy Gradient 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝜏 1 : …… 𝜏 2 : … 𝛻 𝑅 𝜃 = 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝛻 𝑅 𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 𝜏 1 : 𝑠 1 1 , 𝑎 1 1 𝑅 𝜏 1 𝑠 2 1 , 𝑎 2 1 …… 𝜏 2 : 𝑠 1 2 , 𝑎 1 2 𝑅 𝜏 2 𝑠 2 2 , 𝑎 2 2 Given actor parameter 𝜃 𝑎 1 1 =𝑙𝑒𝑓𝑡 NN fire right left 1 𝑠 1 1 … 𝑎 1 2 =𝑓𝑖𝑟𝑒 1 NN fire right left 𝑠 1 2

Policy Gradient 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝜏 1 : 2 2 …… 𝜏 2 : 1 1 … 𝛻 𝑅 𝜃 = 𝜃←𝜃+𝜂𝛻 𝑅 𝜃 𝛻 𝑅 𝜃 = 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 𝜏 1 : 𝑠 1 1 , 𝑎 1 1 𝑅 𝜏 1 𝑠 2 1 , 𝑎 2 1 …… 𝜏 2 : 𝑠 1 2 , 𝑎 1 2 𝑅 𝜏 2 𝑠 2 2 , 𝑎 2 2 Given actor parameter 𝜃 2 Each training data is weighted by 𝑅 𝜏 𝑛 2 𝑠 1 1 NN 𝑎 1 1 =𝑙𝑒𝑓𝑡 1 𝑠 1 1 NN 𝑎 1 1 =𝑙𝑒𝑓𝑡 1 … 𝑠 1 2 NN 𝑎 1 2 =𝑓𝑖𝑟𝑒

Add a Baseline 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 It is possible that 𝑅 𝜏 𝑛 is always positive. 𝜃 𝑛𝑒𝑤 ← 𝜃 𝑜𝑙𝑑 +𝜂𝛻 𝑅 𝜃 𝑜𝑙𝑑 𝛻 𝑅 𝜃 ≈ 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 1 𝑁 𝑛=1 𝑁 𝑡=1 𝑇 𝑛 𝑅 𝜏 𝑛 −𝑏 𝛻𝑙𝑜𝑔𝑝 𝑎 𝑡 𝑛 | 𝑠 𝑡 𝑛 ,𝜃 It is probability … Ideal case a b c a b c Not sampled The probability of the actions not sampled will decrease. Sampling …… a b c a b c

Value-based Approach Learning a Critic Double Q Duel Q

Critic A critic does not determine the action. Given an actor π, it evaluates the how good the actor is An actor can be found from a critic. e.g. Q-learning http://combiboilersleeds.com/picaso/critics/critics-4.html

Critic State value function 𝑉 𝜋 𝑠 When using actor 𝜋, the cumulated reward expects to be obtained after seeing observation (state) s 𝑉 𝜋 𝑉 𝜋 𝑠 (Lots of aliens to be killed, and the shields still exist) s scalar 𝑉 𝜋 𝑠 is large 𝑉 𝜋 𝑠 is smaller

Critic 𝑉 以前的阿光 大馬步飛 =bad 𝑉 變強的阿光 大馬步飛 =good http://combiboilersleeds.com/picaso/critics/critics-4.html

How to estimate 𝑉 𝜋 𝑠 Monte-Carlo based approach 𝑉 𝜋 𝑉 𝜋 The critic watches 𝜋 playing the game After seeing 𝑠 𝑎 , 𝑉 𝜋 𝑠 𝑎 Until the end of the episode, the cumulated reward is 𝐺 𝑎 𝑉 𝜋 𝑠 𝑎 𝐺 𝑎 After seeing 𝑠 𝑏 , 𝑉 𝜋 𝑠 𝑏 Until the end of the episode, the cumulated reward is 𝐺 𝑏 𝑉 𝜋 𝑠 𝑏 𝐺 𝑏

How to estimate 𝑉 𝜋 𝑠 Temporal-difference approach ⋯ 𝑠 𝑡 , 𝑎 𝑡 , 𝑟 𝑡 , 𝑠 𝑡+1 ⋯ 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡 + 𝑟 𝑡 = 𝑉 𝜋 𝑠 𝑡+1 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡 − 𝑉 𝜋 𝑠 𝑡+1 𝑟 𝑡 - The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an on-line, fully incremental fashion. With Monte Carlo methods one must wait until the end of an episode, because only then is the return known, whereas with TD methods one need wait only one time step. Surprisingly often this turns out to be a critical consideration Other applications are continuing tasks and have no episodes at all. 𝑉 𝜋 𝑠 𝑡+1 𝑉 𝜋 𝑠 𝑡+1 Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.

MC v.s. TD … 𝑉 𝜋 𝑉 𝜋 𝑠 𝑎 Larger variance 𝑉 𝜋 𝑠 𝑎 𝐺 𝑎 unbiased 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑎 𝐺 𝑎 unbiased 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡 𝑉 𝜋 𝑠 𝑡+1 𝑠 𝑡+1 𝑟+ Smaller variance May be biased

MC v.s. TD The critic has the following 8 episodes 𝑉 𝜋 𝑠 𝑏 =3/4 [Sutton, v2, Example 6.4] The critic has the following 8 episodes 𝑠 𝑎 ,𝑟=0, 𝑠 𝑏 ,𝑟=0, END 𝑠 𝑏 ,𝑟=1, END 𝑠 𝑏 ,𝑟=0, END 𝑉 𝜋 𝑠 𝑏 =3/4 𝑉 𝜋 𝑠 𝑎 =? 0? 3/4? Monte-Carlo: 𝑉 𝜋 𝑠 𝑎 =0 Temporal-difference: 𝑉 𝜋 𝑠 𝑏 +𝑟= 𝑉 𝜋 𝑠 𝑎 3/4 3/4 (The actions are ignored here.)

Another Critic State-action value function 𝑄 𝜋 𝑠,𝑎 When using actor 𝜋, the cumulated reward expects to be obtained after seeing observation s and taking a s 𝑄 𝜋 𝑄 𝜋 𝑄 𝜋 𝑠,𝑎=𝑙𝑒𝑓𝑡 𝑄 𝜋 𝑠,𝑎 s 𝑄 𝜋 𝑠,𝑎=𝑟𝑖𝑔ℎ𝑡 scalar a 𝑄 𝜋 𝑠,𝑎=𝑓𝑖𝑟𝑒 for discrete action only

Q-Learning 𝜋 interacts with the environment Learning 𝑄 𝜋 𝑠,𝑎 Find a new actor 𝜋 ′ “better” than 𝜋 𝜋= 𝜋 ′ TD or MC Add some noise ?

Q-Learning Given 𝑄 𝜋 𝑠,𝑎 , find a new actor 𝜋 ′ “better” than 𝜋 “Better”: 𝑉 𝜋 ′ 𝑠 ≥ 𝑉 𝜋 𝑠 , for all state s 𝜋 ′ 𝑠 =𝑎𝑟𝑔 max 𝑎 𝑄 𝜋 𝑠,𝑎 𝜋 ′ does not have extra parameters. It depends on Q Not suitable for continuous action a

Deep Reinforcement Learning Actor-Critic Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods.

Actor-Critic 𝜋 interacts with the environment Learning 𝑄 𝜋 𝑠,𝑎 , 𝑉 𝜋 𝑠 Update actor from 𝜋→𝜋’ based on 𝑄 𝜋 𝑠,𝑎 , 𝑉 𝜋 𝑠 𝜋= 𝜋 ′ TD or MC

Actor-Critic Tips The parameters of actor 𝜋 𝑠 and critic 𝑉 𝜋 𝑠 can be shared Network left Network right 𝑠 fire Network 𝑉 𝜋 𝑠

Asynchronous +𝜂∆𝜃 𝜃 1 𝜃 2 ∆𝜃 𝜃 1 𝜃 1 ∆𝜃 1. Copy global parameters Source of image: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2#.68x6na7o9 𝜃 1 +𝜂∆𝜃 𝜃 2 1. Copy global parameters (other workers also update models) 2. Sampling some data ∆𝜃 3. Compute gradients 𝜃 1 4. Update global models 𝜃 1 李思叡 ∆𝜃

Demo of A3C Racing Car (DeepMind) https://www.youtube.com/watch?v=0xo1Ldx3L5Q Racing Car (DeepMind)

Demo of A3C Visual Doom AI Competition @ CIG 2016 https://www.youtube.com/watch?v=94EPSjQH38Y

Concluding Remarks Model-free Approach Policy-based Value-based Alpha Go: policy-based + value-based + model-based Learning an Actor Actor + Critic Learning a Critic Model-based Approach