Deep Reinforcement Learning

Deep Reinforcement Learning
Ph.D Student Wangyu 王宇

Deep Q-Network Published on Nature.
A CNN trained with a variant of Q-learning. Use Atari games as testbed. Use raw pixels as input. Not provided with any game-specific information or hand-designed visual features. Mnih, K. Kavukcuoglu, D. Silver, et al. Human-level control through deep reinforcement learning， Nature, 518(7540):529–533, 2015.

Contribution This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks. This develops a novel artificial agent, termed a deep Q-network, that can learn successful policies directly fromhigh-dimensional sensory inputs using end-to-end reinforcement learning. Mnih, K. Kavukcuoglu, D. Silver, et al. Human-level control through deep reinforcement learning， Nature, 518(7540):529–533, 2015.

What Hinton,Bengio & Lecun said about Deep RL
We expect much of the future progress in vision to come from systems that are trained end-to-end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Yann LeCun, Yoshua Bengio & Geoffrey Hinton, Deep Learning， Nature, doi: ,14539, 2015.

What is RL? Credit: Deep RL tutorial of David Silver, Google DeepMind. Policy : or RL’s task :Find the optimal policy which maximizes the reward.

Markov Decision Process
Ordered sequence: Determinacy: Every adjustment of the arguments of algorithm will determinantly affect the world. MDP:

Retrurn,Value function and Bellman Equation
Return(Reward) Value function: Bellman Equation:

Action-value function
Optimal Action-value function: Iterative update:

Why this basic approach is impractical
1. Because the action-value function is estimated separately for each sequence, without any generalization. 2. How to get the future Q value(Qi) when calculating the current value (Qi+1). 3. Curse of Dimensionality. Function approximator !!!

How to approximate By a linear function approximator:
2. By a nonlinear function approximator such as a neural Network(Q-network).

The Arcade Learning Enviroment(ALE)
1. Visual input (210 x 160 RGB video at 60Hz) 2. A diverse and interesting set of tasks that were designed to be difficult for humans players. 3. Our goal is to create a single neural network agent that is able to successfully learn to play as many of the games as possible.

Schematic illustration

The structure of the CNN
Actual code from

Q-learning Approximate target values:
A Q-network can be trained by minimising a sequence of loss unctions Li(i) that changes at each iteration i. Differentiating the loss function with respect to the weights we arrive at the following gradient

Key Points Two networks with the same structure. The weights of target
network is updated (copied) from the online network periodically. This algorithm is model-free, without explicitly estimating the reward and transition dynamics . 3. This algorithm is policy-free: Epsilon-GREEDY POLICY.

Training algorithm for deep Q-networks(1/2)
Preprocessing : (1) Encode a single framewe take themaximumvalue for each pixel colour value over the frame being encoded and the previous frame; (2) Extract the Y channel and rescale from 210x160 to 84x84; (3) Stack the 4 most recent frames (84x84x4) as the input of Q-function. (4) Normalize the game reward points to +1 , 0 and -1. Experience Replay: (1) Greater data efficiency. (2) Breaks consecutive samples’ correlations and therefore reduces the variance of the updates (3) Smooth out learning and avoid oscillations or divergence in the parameters

Training algorithm for deep Q-networks(2/2)
3. Use two networks with the same structure. A delay between the timean update to Q is made and the time the update affects the targets making divergence or oscillations much more unlikely. 4. Error clipping. Clip the error term from the update to be between -1 and 1. This further improved the stability of the algorithm

Algorithm

Comparison

Visualization of learned value functions(1/2)

Comparison of cases with or without replay and target Q

Some improvements of DQN
Calculation of target Q. Hado van Hasselt, Arthur Guez, David Silver, Deep Reinforcement Learning with Double Q-learning, 22 Sep 2015. Prioritized Experience Replay. Tom Schaul, John Quan, Ioannis Antonoglou, David Silver, Prioritized Experience Replay, 18 Nov 2015. 3. Dueling Network Architectures. Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas, Dueling Network Architectures for Deep Reinforcement Learning 20 Nov 2015. (ICML best paper)

Double Q-learning

End & Thanks

Deep Reinforcement Learning

Similar presentations

Presentation on theme: "Deep Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Reinforcement Learning

Similar presentations

Presentation on theme: "Deep Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback