Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement learning with unsupervised auxiliary tasks

Similar presentations


Presentation on theme: "Reinforcement learning with unsupervised auxiliary tasks"β€” Presentation transcript:

1 Reinforcement learning with unsupervised auxiliary tasks
Max Jaderberg, Volodymyr Mnih et al, 2017, ICLR Youngnam Kim

2 Main idea Deep neural network History of observation = { π‘œ 1 ,…, π‘œ 𝑑 }
History of actions = { π‘Ž 1 , …, π‘Ž 𝑑 } Representation of the state 𝑠 𝑑 History of rewards = { π‘Ÿ 1 , …, π‘Ÿ 𝑑 } We generally believe that good representations result in a good policy Then, how can we make the agent to learn good representations? Assuming that there is an agent which can predict and control the environment perfectly, the agent can achieve any goal in the environment

3 UNREAL(Unsupervised Reinforcement Auxiliary Learning)
They introduce some auxiliary tasks to learn predict and control the environment They don’t use the things learned from these auxiliary tasks directly (auxiliary policy, value function) Instead, the effect is given to the main policy through learned representations in this process indirectly

4 Main task FC LSTM CNN representation FC πœ‹(π‘Ž|𝑠;πœƒ) π‘₯ 𝑑 visual input π‘œ 𝑑
β„Ž 𝑑 representation β„Ž π‘‘βˆ’1 FC 𝑉 πœ‹ (𝑠; πœƒ 𝑣 ) They used Asynchronous advantage actor-critic(A3C) algorithm for the main task Reward: apple (+1) lemon (-1) get goal(+10) etc All experience are stored in replay buffer for auxiliary learning

5 A3C - main idea Copy of environment agent1 agent2 agent3 agent4
Asynchronous Methods for Deep Reinforcement Learning, Volodymyr Mnih et al, 2016 Copy of environment π‘Ž 𝑑 𝑠 𝑑+1 ,π‘Ÿ 𝑑 agent1 agent2 agent3 agent4 Multiple agents Interact with copy of environments simultaneously and accumulate gradient Synchronize a global shared parameter periodically

6 A3C - Loss function 𝐿 𝐴3𝐢 = 𝐿 𝑉𝑅 + 𝐿 πœ‹ βˆ’ 𝐸 𝑠~πœ‹ 𝛼𝐻 πœ‹ 𝑠,π‘Ž;πœƒ 𝐿 𝑉𝑅 = 𝐸 𝑠~πœ‹ [( 𝑅 𝑑:𝑑+𝑛 + 𝛾 𝑛 𝑉 πœ‹ 𝑠 𝑑+𝑛+1 , πœƒ βˆ’ βˆ’ 𝑉 πœ‹ 𝑠 𝑑 , πœƒ 2 ] 𝐿 πœ‹ =βˆ’ E s~πœ‹ R 1:∞ 𝑅 𝑑:𝑑+𝑛 = 𝑖=1 𝑛 𝛾 π‘–βˆ’1 π‘Ÿ 𝑑+π‘–βˆ’1

7 A3C - pseudo code

8 A3C – experimental results
Data efficiency

9 A3C – experimental results
Training speed

10 Value function replay Do value function regression one more time for fast convergence πœƒ 𝑣 = πœƒ 𝑣 +πœ‚ 𝛻 πœƒ 𝑣 𝑉 πœ‹ (𝑠; πœƒ 𝑣 )(( 𝑅 𝑑:𝑑+𝑛 + 𝛾 𝑛 𝑉 𝑠 β€² ; πœƒ βˆ’ βˆ’ 𝑉 πœ‹ (𝑠; πœƒ 𝑣 ))

11 Reward prediction Sample experiences such that 𝑃 π‘Ÿ 𝜏 =0 =0.5
Reward classification = (positive, zero, negative)

12 Pixel control Separate n x n grid over the input frame
Reward = |𝐼 𝐺 𝑖𝑗 βˆ’πΌ 𝐺 𝑖𝑗 β€² | (the absolute difference of average intensity between current and next frame) 𝐼(𝐺 𝑖𝑗 )=π‘Žπ‘£π‘’π‘Ÿπ‘Žπ‘”π‘’ 𝑖𝑛𝑑𝑒𝑛𝑠𝑖𝑑𝑦 π‘œπ‘“ 𝑖, π‘—βˆ’π‘‘β„Ž π‘”π‘Ÿπ‘–π‘‘

13 Pixel control CNN LSTM Replay buffer CNN LSTM 𝐼( 𝐺 𝑖𝑗 )
𝑒 𝑑 =( π‘œ 𝑑 , π‘Ž 𝑑 , π‘Ÿ 𝑑 , π‘œ 𝑑+1 ) 𝑠 𝐼( 𝐺′ 𝑖𝑗 ) CNN LSTM 𝑠′ For each grid, each agent learns to maximize average intensity πœƒ=πœƒ+πœ‚ 𝛻 πœƒ 𝑄 π‘Žπ‘’π‘₯ (𝑠,π‘Ž;πœƒ)(( π‘Ÿ π‘Žπ‘’π‘₯ +𝛾 max π‘Ž β€² 𝑄 π‘Žπ‘’π‘₯ 𝑠 β€² , π‘Ž β€² ; πœƒ βˆ’ βˆ’ 𝑄 π‘Žπ‘’π‘₯ (𝑠, π‘Ž;πœƒ)) π‘Ÿ π‘Žπ‘’π‘₯ =|𝐼 𝐺 𝑖𝑗 βˆ’πΌ 𝐺′ 𝑖𝑗 |

14 N π‘Žπ‘π‘‘ = Number of actions
Pixel control N π‘Žπ‘π‘‘ = Number of actions 𝐴 π‘Žπ‘’π‘₯ =N π‘Žπ‘π‘‘ ×𝑛×𝑛 LSTM Deconv net FC 𝑠 dΓ—π‘€Γ—β„Ž π‘ π‘π‘Žπ‘‘π‘–π‘Žπ‘™ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’ π‘šπ‘Žπ‘ + 𝑄 π‘Žπ‘’π‘₯ LSTM Dueling network 𝑠′ V aux =1×𝑛×𝑛

15 Dueling network Ziyu Wang, Tom Schaul et al, 2016, ICLR
Q-network approximates state and action value function separately 𝑄 𝑠,π‘Ž;πœƒ,𝛼,𝛽 =𝑉 𝑠;πœƒ, 𝛽 +(𝐴 𝑠,π‘Ž;πœƒ,𝛼 βˆ’ 1 𝐴 π‘Ž β€² 𝐴(𝑠, π‘Ž β€² ;πœƒ, 𝛼) )

16 Dueling network Saliency map

17 Aggregated loss function
𝐿 π‘ˆπ‘π‘…πΈπ΄πΏ πœƒ = 𝐿 𝐴3𝐢 + πœ† 𝑉𝑅 𝐿 𝑉𝑅 + πœ† 𝑃𝐢 𝑐 𝐿 𝑄 𝑐 + πœ† 𝑅𝑃 𝐿 𝑅𝑃 𝐿 𝐴3𝐢 = 𝐿 𝑉𝑅 + 𝐿 πœ‹ βˆ’ 𝐸 𝑠~πœ‹ [𝛼𝐻(πœ‹(𝑠,π‘Ž;πœƒ))] 𝐿 𝑉𝑅 = 𝐸 𝑠~πœ‹ [( 𝑅 𝑑:𝑑+𝑛 + 𝛾 𝑛 𝑉 πœ‹ 𝑠 𝑑+𝑛+1 , πœƒ βˆ’ βˆ’ 𝑉 πœ‹ 𝑠 𝑑 , πœƒ 2 ] 𝐿 πœ‹ =βˆ’ E s~πœ‹ R 1:∞ 𝑅 𝑑:𝑑+𝑛 = 𝑖=1 𝑛 𝛾 π‘–βˆ’1 π‘Ÿ 𝑑+π‘–βˆ’1 𝐿 𝑄 𝑐 = 𝐸 𝑠,π‘Ž,π‘Ÿ,𝑠′ [( 𝑅 𝑑:𝑑+𝑛 + 𝛾 𝑛 𝑄 𝑐 𝑠 𝑑+𝑛+1 , πœƒ βˆ’ βˆ’ 𝑄 𝑐 𝑠 𝑑 , πœƒ 2 ] 𝐿 𝑅𝑃 =βˆ’ π‘₯ 𝑝 π‘₯ log π‘ž π‘₯ 𝑝 π‘₯ (𝑝 π‘₯ =π‘”π‘Ÿπ‘œπ‘’π‘›π‘‘ π‘‘π‘Ÿπ‘’π‘‘β„Ž, π‘ž π‘₯ =π‘π‘Ÿπ‘’π‘‘π‘–π‘π‘‘π‘–π‘œπ‘›)

18 Experiments - Labyrinth
4 difficulty = 1) gathering fruits 2) static map 3) dynamic map 4) opponent and slope

19 Experiments

20 Experiments

21 Experiments

22 Conclusion In 3-D visual stream, UNREAL achieved 87% of human user
Not using directly auxiliary policy, It achieved much better performance and data efficiency


Download ppt "Reinforcement learning with unsupervised auxiliary tasks"

Similar presentations


Ads by Google