Reinforcement learning with unsupervised auxiliary tasks

Reinforcement learning with unsupervised auxiliary tasks
Max Jaderberg, Volodymyr Mnih et al, 2017, ICLR Youngnam Kim

Main idea Deep neural network History of observation = { 𝑜 1 ,…, 𝑜 𝑡 }
History of actions = { 𝑎 1 , …, 𝑎 𝑡 } Representation of the state 𝑠 𝑡 History of rewards = { 𝑟 1 , …, 𝑟 𝑡 } We generally believe that good representations result in a good policy Then, how can we make the agent to learn good representations? Assuming that there is an agent which can predict and control the environment perfectly, the agent can achieve any goal in the environment

UNREAL(Unsupervised Reinforcement Auxiliary Learning)
They introduce some auxiliary tasks to learn predict and control the environment They don’t use the things learned from these auxiliary tasks directly (auxiliary policy, value function) Instead, the effect is given to the main policy through learned representations in this process indirectly

Main task FC LSTM CNN representation FC 𝜋(𝑎|𝑠;𝜃) 𝑥 𝑡 visual input 𝑜 𝑡
ℎ 𝑡 representation ℎ 𝑡−1 FC 𝑉 𝜋 (𝑠; 𝜃 𝑣 ) They used Asynchronous advantage actor-critic(A3C) algorithm for the main task Reward: apple (+1) lemon (-1) get goal(+10) etc All experience are stored in replay buffer for auxiliary learning

A3C - main idea Copy of environment agent1 agent2 agent3 agent4
Asynchronous Methods for Deep Reinforcement Learning, Volodymyr Mnih et al, 2016 Copy of environment 𝑎 𝑡 𝑠 𝑡+1 ,𝑟 𝑡 agent1 agent2 agent3 agent4 Multiple agents Interact with copy of environments simultaneously and accumulate gradient Synchronize a global shared parameter periodically

A3C - Loss function 𝐿 𝐴3𝐶 = 𝐿 𝑉𝑅 + 𝐿 𝜋 − 𝐸 𝑠~𝜋 𝛼𝐻 𝜋 𝑠,𝑎;𝜃 𝐿 𝑉𝑅 = 𝐸 𝑠~𝜋 [( 𝑅 𝑡:𝑡+𝑛 + 𝛾 𝑛 𝑉 𝜋 𝑠 𝑡+𝑛+1 , 𝜃 − − 𝑉 𝜋 𝑠 𝑡 , 𝜃 2 ] 𝐿 𝜋 =− E s~𝜋 R 1:∞ 𝑅 𝑡:𝑡+𝑛 = 𝑖=1 𝑛 𝛾 𝑖−1 𝑟 𝑡+𝑖−1

A3C - pseudo code

A3C – experimental results
Data efficiency

A3C – experimental results
Training speed

Value function replay Do value function regression one more time for fast convergence 𝜃 𝑣 = 𝜃 𝑣 +𝜂 𝛻 𝜃 𝑣 𝑉 𝜋 (𝑠; 𝜃 𝑣 )(( 𝑅 𝑡:𝑡+𝑛 + 𝛾 𝑛 𝑉 𝑠 ′ ; 𝜃 − − 𝑉 𝜋 (𝑠; 𝜃 𝑣 ))

Reward prediction Sample experiences such that 𝑃 𝑟 𝜏 =0 =0.5
Reward classification = (positive, zero, negative)

Pixel control Separate n x n grid over the input frame
Reward = |𝐼 𝐺 𝑖𝑗 −𝐼 𝐺 𝑖𝑗 ′ | (the absolute difference of average intensity between current and next frame) 𝐼(𝐺 𝑖𝑗 )=𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 𝑜𝑓 𝑖, 𝑗−𝑡ℎ 𝑔𝑟𝑖𝑑

Pixel control CNN LSTM Replay buffer CNN LSTM 𝐼( 𝐺 𝑖𝑗 )
𝑒 𝑡 =( 𝑜 𝑡 , 𝑎 𝑡 , 𝑟 𝑡 , 𝑜 𝑡+1 ) 𝑠 𝐼( 𝐺′ 𝑖𝑗 ) CNN LSTM 𝑠′ For each grid, each agent learns to maximize average intensity 𝜃=𝜃+𝜂 𝛻 𝜃 𝑄 𝑎𝑢𝑥 (𝑠,𝑎;𝜃)(( 𝑟 𝑎𝑢𝑥 +𝛾 max 𝑎 ′ 𝑄 𝑎𝑢𝑥 𝑠 ′ , 𝑎 ′ ; 𝜃 − − 𝑄 𝑎𝑢𝑥 (𝑠, 𝑎;𝜃)) 𝑟 𝑎𝑢𝑥 =|𝐼 𝐺 𝑖𝑗 −𝐼 𝐺′ 𝑖𝑗 |

N 𝑎𝑐𝑡 = Number of actions
Pixel control N 𝑎𝑐𝑡 = Number of actions 𝐴 𝑎𝑢𝑥 =N 𝑎𝑐𝑡 ×𝑛×𝑛 LSTM Deconv net FC 𝑠 d×𝑤×ℎ 𝑠𝑝𝑎𝑡𝑖𝑎𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑚𝑎𝑝 + 𝑄 𝑎𝑢𝑥 LSTM Dueling network 𝑠′ V aux =1×𝑛×𝑛

Dueling network Ziyu Wang, Tom Schaul et al, 2016, ICLR
Q-network approximates state and action value function separately 𝑄 𝑠,𝑎;𝜃,𝛼,𝛽 =𝑉 𝑠;𝜃, 𝛽 +(𝐴 𝑠,𝑎;𝜃,𝛼 − 1 𝐴 𝑎 ′ 𝐴(𝑠, 𝑎 ′ ;𝜃, 𝛼) )

Dueling network Saliency map

Aggregated loss function
𝐿 𝑈𝑁𝑅𝐸𝐴𝐿 𝜃 = 𝐿 𝐴3𝐶 + 𝜆 𝑉𝑅 𝐿 𝑉𝑅 + 𝜆 𝑃𝐶 𝑐 𝐿 𝑄 𝑐 + 𝜆 𝑅𝑃 𝐿 𝑅𝑃 𝐿 𝐴3𝐶 = 𝐿 𝑉𝑅 + 𝐿 𝜋 − 𝐸 𝑠~𝜋 [𝛼𝐻(𝜋(𝑠,𝑎;𝜃))] 𝐿 𝑉𝑅 = 𝐸 𝑠~𝜋 [( 𝑅 𝑡:𝑡+𝑛 + 𝛾 𝑛 𝑉 𝜋 𝑠 𝑡+𝑛+1 , 𝜃 − − 𝑉 𝜋 𝑠 𝑡 , 𝜃 2 ] 𝐿 𝜋 =− E s~𝜋 R 1:∞ 𝑅 𝑡:𝑡+𝑛 = 𝑖=1 𝑛 𝛾 𝑖−1 𝑟 𝑡+𝑖−1 𝐿 𝑄 𝑐 = 𝐸 𝑠,𝑎,𝑟,𝑠′ [( 𝑅 𝑡:𝑡+𝑛 + 𝛾 𝑛 𝑄 𝑐 𝑠 𝑡+𝑛+1 , 𝜃 − − 𝑄 𝑐 𝑠 𝑡 , 𝜃 2 ] 𝐿 𝑅𝑃 =− 𝑥 𝑝 𝑥 log 𝑞 𝑥 𝑝 𝑥 (𝑝 𝑥 =𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ, 𝑞 𝑥 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛)

Experiments - Labyrinth
4 difficulty = 1) gathering fruits 2) static map 3) dynamic map 4) opponent and slope

Experiments

Conclusion In 3-D visual stream, UNREAL achieved 87% of human user
Not using directly auxiliary policy, It achieved much better performance and data efficiency

Reinforcement learning with unsupervised auxiliary tasks

Similar presentations

Presentation on theme: "Reinforcement learning with unsupervised auxiliary tasks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement learning with unsupervised auxiliary tasks

Similar presentations

Presentation on theme: "Reinforcement learning with unsupervised auxiliary tasks"— Presentation transcript:

Similar presentations

About project

Feedback