"Playing Atari with deep reinforcement learning."

Slides:

Advertisements

Similar presentations

Reinforcement learning

Advertisements

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Decision Theoretic Planning

Reinforcement learning (Chapter 21)

Hidden Markov Models Theory By Johan Walters (SR 2003)

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Markov Decision Processes

Reinforcement learning

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

1 Introduction to Reinforcement Learning Freek Stulp.

Reinforcement learning (Chapter 21)

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Chapter 6 Neural Network.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Reinforcement Learning

Deep Reinforcement Learning

Reinforcement Learning

Figure 5: Change in Blackjack Posterior Distributions over Time.

Continuous Control with Prioritized Experience Replay

Reinforcement Learning

Reinforcement Learning

Deep Reinforcement Learning

A Comparison of Learning Algorithms on the ALE

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10

Reinforcement learning

Reinforcement Learning

A Crash Course in Reinforcement Learning

Reinforcement learning (Chapter 21)

ReinforcementLearning: A package for replicating human behavior in R

Reinforcement Learning (1)

Reinforcement learning (Chapter 21)

Reinforcement Learning

Markov Decision Processes

Deep reinforcement learning

Reinforcement learning with unsupervised auxiliary tasks

Human-level control through deep reinforcement learning

CS 188: Artificial Intelligence

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

RL methods in practice Alekh Agarwal.

Reinforcement learning

Training a Neural Network

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Dr. Unnikrishnan P.C. Professor, EEE

RL for Large State Spaces: Value Function Approximation

Reinforcement Learning with Neural Networks

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Reinforcement Learning

Reinforcement Learning

CS 188: Artificial Intelligence Spring 2006

Deep Reinforcement Learning

Designing Neural Network Architectures Using Reinforcement Learning

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

Presentation transcript:

"Playing Atari with deep reinforcement learning." Mnih, Volodymyr, et al. Presented by Moshe Abuhasira, Andrey Gurevich, Shachar Schnapp

So far… Supervised Learning Data: (x, y) x is data, y is label Goal: Learn a function to map x -> y Cat Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc. Classification This image is CC0 public domain

So far… Unsupervised Learning Data: x Just data, no labels! 1-d density estimation Goal: Learn some underlying hidden structure of the data Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation 2-d density images left and right are CC0 public domain

Reinforcement Learning Problems involving an agent interacting with an environment, which provides numeric reward signals Goal: Learn how to take actions in order to maximize reward

Overview What is Reinforcement Learning? Markov Decision Processes Q-Learning

Reinforcement Learning Agent Environment

Reinforcement Learning Agent State st Environment

Reinforcement Learning Agent State st Action at Environment

Reinforcement Learning Agent State st Reward rt Action at Environment

Reinforcement Learning Agent State st Reward rt Next state s Action a t t+1 Environment

Cart-Pole Problem Objective: Balance a pole on top of a movable cart State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright This image is CC0 public domain

Robot Locomotion Objective: Make the robot move forward State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement

Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step

Go Objective: Win the game! State: Position of all pieces Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise This image is CC0 public domain

How can we mathematically formalize the RL problem? Agent State st Reward rt Next state s Action a t t+1 Environment

Markov Decision Process Mathematical formulation of the RL problem Markov property: Current state completely characterises the state of the world Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor

Markov Decision Process At time step t=0, environment samples initial state s0 ~ p(s0) Then, for t=0 until done: Agent selects action at Environment samples reward rt ~ R( . | st, at) Environment samples next state st+1 ~ P( . | st, at) Agent receives reward rt and next state st+1 A policy u is a function from S to A that specifies what action to take in each state Objective: find policy u* that maximizes cumulative discounted reward:

A simple MDP: Grid World states actions = { right left up down ★ Set a negative “reward” for each transition (e.g. r = -1) } Objective: reach one of terminal states (greyed out) in least number of actions

A simple MDP: Grid World ★ ★ Random Policy Optimal Policy

The optimal policy 𝝅* We want to find optimal policy 𝝅* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)?

The optimal policy 𝝅* We want to find optimal policy 𝝅* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards! Formally: with

Definitions: Value function and Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:

Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of

Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal policy u* corresponds to taking the best action in any state as specified by Q*

A simple MDP: Grid World ★ ★ Random Policy Optimal Policy

Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity

Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this?

Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this? Must compute Q(s,a) for every state-action pair. Is there are many states s we can’t see all – generalization is needed

Solving for the optimal policy Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this? Must compute Q(s,a) for every state-action pair. If there are many states s we can’t see all – generalization is needed Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!

Q-learning Q-learning: Use a function approximator to estimate the action-value function If the function approximator is a deep neural network => deep q-learning!

Q-learning Q-learning: Use a function approximator to estimate the action-value function function parameters (weights) If the function approximator is a deep neural network => deep q-learning!

Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation:

Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where

Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):

Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): close to the target value (y ) it i should have, if Q-function corresponds to optimal Q* (and optimal policy u*)

Exploration vs exploitation: decision making: We need to trade off exploration (gathering data about the actions rewords) and exploitation (making decisions based on data already gathered) The Greedy does not explore sufficiently : Exploration: choose an action never used before in this state. Exploitation: choose the actions with the highest expected reword.

Epsilon greedy Set 𝜺(𝒕) = [0,1] With prob .𝜺(𝒕) : Explore by picking an action chosen uniformly at random With prob. 𝟏 − 𝜺(𝒕) : Exploit by picking an action with highest reword expectation.

Playing Atari Games The [Mnih et al. NIPS Workshop 2013; Nature 2015] Playing Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls: Left, Right, Up, Down, fire etc. Reward: Score increase/decrease at each time step The

Q-network Architecture [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Input: state st Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

Q-network Architecture [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 Familiar conv layers, FC layer 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

Q-network Architecture [Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 A single feedforward pass to compute Q-values for all actions from the current state = efficient! Number of actions between 4-18 depending on Atari game Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

Training the Q-network: Loss function [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Loss function Bellman Equation: Forward pass Iteratively try to make The Q-function value Close to the target Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):

Training the Q-network: Experience Replay [Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: Samples are correlated which mean inefficient learning Current Q-network parameters determines next training samples (if maximizing action is to move left, training samples will be dominated by samples from left-hand)- may result in local minima. The authors address these problems using experience replay: Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played. Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples

Putting it together: DQN [Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: DQN Epsilon annealed linearly from 1 to 0.1 , for the first 1M frames and than fixed on 0.1

Evaluation:

Evaluation: A B C