Presentation is loading. Please wait.

Presentation is loading. Please wait.

"Playing Atari with deep reinforcement learning."

Similar presentations


Presentation on theme: ""Playing Atari with deep reinforcement learning.""— Presentation transcript:

1 "Playing Atari with deep reinforcement learning."
Mnih, Volodymyr, et al. Presented by Moshe Abuhasira, Andrey Gurevich, Shachar Schnapp

2 So far… Supervised Learning
Data: (x, y) x is data, y is label Goal: Learn a function to map x -> y Cat Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc. Classification This image is CC0 public domain

3 So far… Unsupervised Learning
Data: x Just data, no labels! 1-d density estimation Goal: Learn some underlying hidden structure of the data Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation 2-d density images left and right are CC0 public domain

4 Reinforcement Learning
Problems involving an agent interacting with an environment, which provides numeric reward signals Goal: Learn how to take actions in order to maximize reward

5 Overview What is Reinforcement Learning? Markov Decision Processes
Q-Learning

6 Reinforcement Learning
Agent Environment

7 Reinforcement Learning
Agent State st Environment

8 Reinforcement Learning
Agent State st Action at Environment

9 Reinforcement Learning
Agent State st Reward rt Action at Environment

10 Reinforcement Learning
Agent State st Reward rt Next state s Action a t t+1 Environment

11 Cart-Pole Problem Objective: Balance a pole on top of a movable cart
State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright This image is CC0 public domain

12 Robot Locomotion Objective: Make the robot move forward
State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement

13 Atari Games Objective: Complete the game with the highest score
State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step

14 Go Objective: Win the game! State: Position of all pieces
Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise This image is CC0 public domain

15 How can we mathematically formalize the RL problem?
Agent State st Reward rt Next state s Action a t t+1 Environment

16 Markov Decision Process
Mathematical formulation of the RL problem Markov property: Current state completely characterises the state of the world Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor

17 Markov Decision Process
At time step t=0, environment samples initial state s0 ~ p(s0) Then, for t=0 until done: Agent selects action at Environment samples reward rt ~ R( . | st, at) Environment samples next state st+1 ~ P( . | st, at) Agent receives reward rt and next state st+1 A policy u is a function from S to A that specifies what action to take in each state Objective: find policy u* that maximizes cumulative discounted reward:

18 A simple MDP: Grid World
states actions = { right left up down Set a negative “reward” for each transition (e.g. r = -1) } Objective: reach one of terminal states (greyed out) in least number of actions

19 A simple MDP: Grid World
Random Policy Optimal Policy

20 The optimal policy 𝝅* We want to find optimal policy 𝝅* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)?

21 The optimal policy 𝝅* We want to find optimal policy 𝝅* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards! Formally: with

22 Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

23 Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:

24 Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of

25 Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal policy u* corresponds to taking the best action in any state as specified by Q*

26 A simple MDP: Grid World
Random Policy Optimal Policy

27

28 Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity

29 Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this?

30 Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this? Must compute Q(s,a) for every state-action pair. Is there are many states s we can’t see all – generalization is needed

31 Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this? Must compute Q(s,a) for every state-action pair. If there are many states s we can’t see all – generalization is needed Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!

32 Q-learning Q-learning: Use a function approximator to estimate the action-value function If the function approximator is a deep neural network => deep q-learning!

33 Q-learning Q-learning: Use a function approximator to estimate the action-value function function parameters (weights) If the function approximator is a deep neural network => deep q-learning!

34 Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation:

35 Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where

36 Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):

37 Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): close to the target value (y ) it i should have, if Q-function corresponds to optimal Q* (and optimal policy u*)

38 Exploration vs exploitation:
decision making: We need to trade off exploration (gathering data about the actions rewords) and exploitation (making decisions based on data already gathered) The Greedy does not explore sufficiently : Exploration: choose an action never used before in this state. Exploitation: choose the actions with the highest expected reword.

39 Epsilon greedy Set 𝜺(𝒕) = [0,1]
With prob .𝜺(𝒕) : Explore by picking an action chosen uniformly at random With prob. 𝟏 − 𝜺(𝒕) : Exploit by picking an action with highest reword expectation.

40 Playing Atari Games The
[Mnih et al. NIPS Workshop 2013; Nature 2015] Playing Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls: Left, Right, Up, Down, fire etc. Reward: Score increase/decrease at each time step The

41 Q-network Architecture
[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Input: state st Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

42 Q-network Architecture
[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 Familiar conv layers, FC layer 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

43 Q-network Architecture
[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 A single feedforward pass to compute Q-values for all actions from the current state = efficient! Number of actions between depending on Atari game Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

44 Training the Q-network: Loss function
[Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Loss function Bellman Equation: Forward pass Iteratively try to make The Q-function value Close to the target Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):

45 Training the Q-network: Experience Replay
[Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: Samples are correlated which mean inefficient learning Current Q-network parameters determines next training samples (if maximizing action is to move left, training samples will be dominated by samples from left-hand)- may result in local minima. The authors address these problems using experience replay: Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played. Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples

46 Putting it together: DQN
[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: DQN Epsilon annealed linearly from 1 to 0.1 , for the first 1M frames and than fixed on 0.1

47 Evaluation:

48 Evaluation: A B C


Download ppt ""Playing Atari with deep reinforcement learning.""

Similar presentations


Ads by Google