Download presentation
Presentation is loading. Please wait.
Published byJohan Darmali Modified over 6 years ago
1
Reinforcement learning with unsupervised auxiliary tasks
Max Jaderberg, Volodymyr Mnih et al, 2017, ICLR Youngnam Kim
2
Main idea Deep neural network History of observation = { π 1 ,β¦, π π‘ }
History of actions = { π 1 , β¦, π π‘ } Representation of the state π π‘ History of rewards = { π 1 , β¦, π π‘ } We generally believe that good representations result in a good policy Then, how can we make the agent to learn good representations? Assuming that there is an agent which can predict and control the environment perfectly, the agent can achieve any goal in the environment
3
UNREAL(Unsupervised Reinforcement Auxiliary Learning)
They introduce some auxiliary tasks to learn predict and control the environment They donβt use the things learned from these auxiliary tasks directly (auxiliary policy, value function) Instead, the effect is given to the main policy through learned representations in this process indirectly
4
Main task FC LSTM CNN representation FC π(π|π ;π) π₯ π‘ visual input π π‘
β π‘ representation β π‘β1 FC π π (π ; π π£ ) They used Asynchronous advantage actor-critic(A3C) algorithm for the main task Reward: apple (+1) lemon (-1) get goal(+10) etc All experience are stored in replay buffer for auxiliary learning
5
A3C - main idea Copy of environment agent1 agent2 agent3 agent4
Asynchronous Methods for Deep Reinforcement Learning, Volodymyr Mnih et al, 2016 Copy of environment π π‘ π π‘+1 ,π π‘ agent1 agent2 agent3 agent4 Multiple agents Interact with copy of environments simultaneously and accumulate gradient Synchronize a global shared parameter periodically
6
A3C - Loss function πΏ π΄3πΆ = πΏ ππ
+ πΏ π β πΈ π ~π πΌπ» π π ,π;π πΏ ππ
= πΈ π ~π [( π
π‘:π‘+π + πΎ π π π π π‘+π+1 , π β β π π π π‘ , π 2 ] πΏ π =β E s~π R 1:β π
π‘:π‘+π = π=1 π πΎ πβ1 π π‘+πβ1
7
A3C - pseudo code
8
A3C β experimental results
Data efficiency
9
A3C β experimental results
Training speed
10
Value function replay Do value function regression one more time for fast convergence π π£ = π π£ +π π» π π£ π π (π ; π π£ )(( π
π‘:π‘+π + πΎ π π π β² ; π β β π π (π ; π π£ ))
11
Reward prediction Sample experiences such that π π π =0 =0.5
Reward classification = (positive, zero, negative)
12
Pixel control Separate n x n grid over the input frame
Reward = |πΌ πΊ ππ βπΌ πΊ ππ β² | (the absolute difference of average intensity between current and next frame) πΌ(πΊ ππ )=ππ£πππππ πππ‘πππ ππ‘π¦ ππ π, πβπ‘β ππππ
13
Pixel control CNN LSTM Replay buffer CNN LSTM πΌ( πΊ ππ )
π π‘ =( π π‘ , π π‘ , π π‘ , π π‘+1 ) π πΌ( πΊβ² ππ ) CNN LSTM π β² For each grid, each agent learns to maximize average intensity π=π+π π» π π ππ’π₯ (π ,π;π)(( π ππ’π₯ +πΎ max π β² π ππ’π₯ π β² , π β² ; π β β π ππ’π₯ (π , π;π)) π ππ’π₯ =|πΌ πΊ ππ βπΌ πΊβ² ππ |
14
N πππ‘ = Number of actions
Pixel control N πππ‘ = Number of actions π΄ ππ’π₯ =N πππ‘ ΓπΓπ LSTM Deconv net FC π dΓπ€Γβ π πππ‘πππ ππππ‘π’ππ πππ + π ππ’π₯ LSTM Dueling network π β² V aux =1ΓπΓπ
15
Dueling network Ziyu Wang, Tom Schaul et al, 2016, ICLR
Q-network approximates state and action value function separately π π ,π;π,πΌ,π½ =π π ;π, π½ +(π΄ π ,π;π,πΌ β 1 π΄ π β² π΄(π , π β² ;π, πΌ) )
16
Dueling network Saliency map
17
Aggregated loss function
πΏ πππ
πΈπ΄πΏ π = πΏ π΄3πΆ + π ππ
πΏ ππ
+ π ππΆ π πΏ π π + π π
π πΏ π
π πΏ π΄3πΆ = πΏ ππ
+ πΏ π β πΈ π ~π [πΌπ»(π(π ,π;π))] πΏ ππ
= πΈ π ~π [( π
π‘:π‘+π + πΎ π π π π π‘+π+1 , π β β π π π π‘ , π 2 ] πΏ π =β E s~π R 1:β π
π‘:π‘+π = π=1 π πΎ πβ1 π π‘+πβ1 πΏ π π = πΈ π ,π,π,π β² [( π
π‘:π‘+π + πΎ π π π π π‘+π+1 , π β β π π π π‘ , π 2 ] πΏ π
π =β π₯ π π₯ log π π₯ π π₯ (π π₯ =ππππ’ππ π‘ππ’π‘β, π π₯ =πππππππ‘πππ)
18
Experiments - Labyrinth
4 difficulty = 1) gathering fruits 2) static map 3) dynamic map 4) opponent and slope
19
Experiments
20
Experiments
21
Experiments
22
Conclusion In 3-D visual stream, UNREAL achieved 87% of human user
Not using directly auxiliary policy, It achieved much better performance and data efficiency
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.