Download presentation
Presentation is loading. Please wait.
Published byΣήθι Κουρμούλης Modified over 6 years ago
1
"Playing Atari with deep reinforcement learning."
Mnih, Volodymyr, et al. Presented by Moshe Abuhasira, Andrey Gurevich, Shachar Schnapp
2
So far… Supervised Learning
Data: (x, y) x is data, y is label Goal: Learn a function to map x -> y Cat Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc. Classification This image is CC0 public domain
3
So far… Unsupervised Learning
Data: x Just data, no labels! 1-d density estimation Goal: Learn some underlying hidden structure of the data Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation 2-d density images left and right are CC0 public domain
4
Reinforcement Learning
Problems involving an agent interacting with an environment, which provides numeric reward signals Goal: Learn how to take actions in order to maximize reward
5
Overview What is Reinforcement Learning? Markov Decision Processes
Q-Learning
6
Reinforcement Learning
Agent Environment
7
Reinforcement Learning
Agent State st Environment
8
Reinforcement Learning
Agent State st Action at Environment
9
Reinforcement Learning
Agent State st Reward rt Action at Environment
10
Reinforcement Learning
Agent State st Reward rt Next state s Action a t t+1 Environment
11
Cart-Pole Problem Objective: Balance a pole on top of a movable cart
State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright This image is CC0 public domain
12
Robot Locomotion Objective: Make the robot move forward
State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement
13
Atari Games Objective: Complete the game with the highest score
State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step
14
Go Objective: Win the game! State: Position of all pieces
Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise This image is CC0 public domain
15
How can we mathematically formalize the RL problem?
Agent State st Reward rt Next state s Action a t t+1 Environment
16
Markov Decision Process
Mathematical formulation of the RL problem Markov property: Current state completely characterises the state of the world Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor
17
Markov Decision Process
At time step t=0, environment samples initial state s0 ~ p(s0) Then, for t=0 until done: Agent selects action at Environment samples reward rt ~ R( . | st, at) Environment samples next state st+1 ~ P( . | st, at) Agent receives reward rt and next state st+1 A policy u is a function from S to A that specifies what action to take in each state Objective: find policy u* that maximizes cumulative discounted reward:
18
A simple MDP: Grid World
states actions = { right left up down ★ Set a negative “reward” for each transition (e.g. r = -1) } Objective: reach one of terminal states (greyed out) in least number of actions
19
A simple MDP: Grid World
★ ★ Random Policy Optimal Policy
20
The optimal policy 𝝅* We want to find optimal policy 𝝅* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)?
21
The optimal policy 𝝅* We want to find optimal policy 𝝅* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards! Formally: with
22
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:
23
Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:
24
Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of
25
Bellman equation The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair: Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal policy u* corresponds to taking the best action in any state as specified by Q*
26
A simple MDP: Grid World
★ ★ Random Policy Optimal Policy
28
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity
29
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this?
30
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this? Must compute Q(s,a) for every state-action pair. Is there are many states s we can’t see all – generalization is needed
31
Solving for the optimal policy
Value iteration algorithm: Use Bellman equation as an iterative update Qi will converge to Q* as i -> infinity What’s the problem with this? Must compute Q(s,a) for every state-action pair. If there are many states s we can’t see all – generalization is needed Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!
32
Q-learning Q-learning: Use a function approximator to estimate the action-value function If the function approximator is a deep neural network => deep q-learning!
33
Q-learning Q-learning: Use a function approximator to estimate the action-value function function parameters (weights) If the function approximator is a deep neural network => deep q-learning!
34
Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation:
35
Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where
36
Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):
37
Q-learning Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ): close to the target value (y ) it i should have, if Q-function corresponds to optimal Q* (and optimal policy u*)
38
Exploration vs exploitation:
decision making: We need to trade off exploration (gathering data about the actions rewords) and exploitation (making decisions based on data already gathered) The Greedy does not explore sufficiently : Exploration: choose an action never used before in this state. Exploitation: choose the actions with the highest expected reword.
39
Epsilon greedy Set 𝜺(𝒕) = [0,1]
With prob .𝜺(𝒕) : Explore by picking an action chosen uniformly at random With prob. 𝟏 − 𝜺(𝒕) : Exploit by picking an action with highest reword expectation.
40
Playing Atari Games The
[Mnih et al. NIPS Workshop 2013; Nature 2015] Playing Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls: Left, Right, Up, Down, fire etc. Reward: Score increase/decrease at each time step The
41
Q-network Architecture
[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Input: state st Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
42
Q-network Architecture
[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) FC-256 Familiar conv layers, FC layer 32 4x4 conv, stride 2 16 8x8 conv, stride 4 Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
43
Q-network Architecture
[Mnih et al. NIPS Workshop 2013; Nature 2015] Q-network Architecture : neural network with weights FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) FC-256 32 4x4 conv, stride 2 16 8x8 conv, stride 4 A single feedforward pass to compute Q-values for all actions from the current state = efficient! Number of actions between depending on Atari game Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
44
Training the Q-network: Loss function
[Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Loss function Bellman Equation: Forward pass Iteratively try to make The Q-function value Close to the target Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):
45
Training the Q-network: Experience Replay
[Mnih et al. NIPS Workshop 2013; Nature 2015] Training the Q-network: Experience Replay Learning from batches of consecutive samples is problematic: Samples are correlated which mean inefficient learning Current Q-network parameters determines next training samples (if maximizing action is to move left, training samples will be dominated by samples from left-hand)- may result in local minima. The authors address these problems using experience replay: Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played. Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples
46
Putting it together: DQN
[Mnih et al. NIPS Workshop 2013; Nature 2015] Putting it together: DQN Epsilon annealed linearly from 1 to 0.1 , for the first 1M frames and than fixed on 0.1
47
Evaluation:
48
Evaluation: A B C
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.