Download presentation
Presentation is loading. Please wait.
Published byLiana Widjaja Modified over 6 years ago
1
Hybrid computing using a neural network with dynamic external memory
Alex Graves, Greg Wayne et al, 2016, Nature Youngnam Kim
2
Outline This paper proposed an improved version of neural Turing machines The name is the Differentiable Neural Computer a.k.a. DNC The 3 main differences are dynamic memory allocation improved location-based addressing – temporal memory linkage the agent can learn how much to write The presentation is going to address neural Turing machines briefly what are different between NTMs and DNCs experimental results
3
Neural Turing machines(Alex Graves et al, 2014)
imitate Turing machines by memory networks having an external memory M t ∈ ℝ 𝑁×𝑑 , 𝑁 = number of memory locations, 𝑑 = memory vector dimension read and write heads, interaction with memory must be differentiable a controller learns what and where to read and write, generally RNNs are used
4
𝑒 𝑡 ∈ ℝ 𝑤 is an erase vector, 𝑎 𝑡 ∈ ℝ 𝑤 is an add vector
Neural Turing machines – read and write To be differentiable, we do Attention read and write everywhere to the different extent Reading 𝑖 𝑤 𝑡 (𝑖) =1, 0≤ 𝑤 𝑡 𝑖 ≤1 𝑟 𝑡 ← 𝑖 𝑤 𝑡 𝑖 𝑴 𝑡 (𝑖) Writing 𝑴 𝑡 𝑖 ← 𝑴 𝑡−1 𝑖 ⊙ 𝟏− 𝑤 𝑡 𝑖 𝒆 𝑡 + 𝑤 𝑡 𝑖 𝒂 𝑡 𝑒 𝑡 ∈ ℝ 𝑤 is an erase vector, 𝑎 𝑡 ∈ ℝ 𝑤 is an add vector
5
Neural Turing machines – addressing
Addressing – how to produce weights for read and write operations Content-based addressing Location-based addressing a controller produces key vector 𝒌 𝒕 and key strength 𝜷 𝒕 ≥1, then content-based weight 𝑤 𝑡 𝑐 is 𝒘 𝑡 𝑐 = exp{𝛽×𝑆( 𝒌 𝑡 , 𝑴(𝑖))} 𝑗 exp{𝛽×𝑆( 𝒌 𝑡 , 𝑴(𝑗)} 𝑆 𝒖,𝒗 = 𝒖∙𝒗 ‖𝒖‖‖𝒗‖ 𝑆 is a similarity function, generally cosine similarity DNC used the same content-based addressing of NTMs
6
Neural Turing machines – addressing
Location based addressing(different from DNCs) In NTMs, interpolates the content weights 𝒘 𝑡 𝑐 and previous weights 𝒘 𝑡−1 before shift 𝒘 𝑡 𝑔 ← 𝑔 𝑡 𝒘 𝑡 𝑐 + 1− 𝑔 𝑡 𝒘 𝑡−1 where an interpolation gate 𝑔 𝑡 is a scalar in the range (0,1) after interpolation, shifts 𝒘 𝑡 𝑐 using shift distribution 𝒔 𝑡 𝑤 𝑡 𝑖 = 𝑗=0 𝑁−1 𝑤 𝑡 𝑔 𝑗 𝑠 𝑡 (𝑖−𝑗) to avoid leakage and dispersion of weightings, use sharpening parameter 𝛾 𝑡 ≥1 𝑤 𝑡 𝑖 ← 𝑤 𝑡 𝑖 𝛾 𝑡 𝑗 𝑤 𝑡 𝑗 𝛾 𝑡
7
Neural Turing machines – addressing
an example of shift weightings disadvantages: we can iterate on only adjacent elements
8
Differentiable Neural Computers – architecture
9
Differentiable Neural Computers – write operation
dynamic memory allocations the agent learns deciding whether the location to be freed in which it reads to do this, the read head produces an allocation vector 𝒂 𝑡 ∈ 0,1 𝑁 when usage vector 𝑢 𝑡 𝑖 is close to 0, indicates 𝑖-th memory location being free 𝝍 𝑡 = 𝑖=1 𝑅 (𝟏− 𝑓 𝑡 𝑖 𝒘 𝑡−1 𝑟,𝑖 ) 𝒖 𝑡 = 𝒖 𝑡−1 + 𝒘 𝑡−1 𝑤 − 𝒖 𝑡−1 ⊙ 𝒘 𝑡−1 𝑤 ⊙ 𝝍 𝑡 𝒂 𝑡 𝝓 𝑡 𝑗 = 1− 𝒖 𝑡 𝝓 𝑡 𝑗 𝑖=1 𝑗−1 𝒖 𝑡 [ 𝝓 𝑡 [𝑖]] overwriting how much free the location is force to use locations more free where 𝝍 𝑡 is a retention vector, 𝑓 𝑡 𝑖 is a free gate of read head 𝑖 and 𝝓 𝑡 is free list 𝒘 𝑡−1 𝑟,𝑖 is a 𝑖-th read weighting of previous time step and 𝒘 𝑡−1 𝑤 is a write weighting free list 𝜙 𝑡 is a sorted list of index in ascending order of usage
10
Differentiable Neural Computers – write operation
interpolating content weighting 𝒄 𝑡 𝑤 and allocation weightings 𝒂 𝑡 𝒘 𝑡 𝑤 = 𝑔 𝑡 𝑤 [ 𝑔 𝑡 𝑎 𝒂 𝑡 + 1− 𝑔 𝑡 𝑎 𝒄 𝑡 𝑤 ] where 𝑔 𝑡 𝑤 is a write gate and 𝑔 𝑡 𝑎 is an allocation gate
11
copy of 10 sequences of length 5
Differentiable Neural Computers – write operation copy of 10 sequences of length 5 with memory size 10
12
Differentiable Neural Computers – read operation
Temporal memory linkage after write operation, we can store information about the order in which the data are written here, let the linkage matrix 𝐿∈ 0,1 𝑁×𝑁 𝐿[𝑖,𝑗] represent the degree to which location 𝑖 was the location written to after location 𝑗 𝒑 𝑡 = 1− 𝑖 𝒘 𝑡 𝑤 𝑖 𝒑 𝑡−1 + 𝒘 𝑡 𝑤 , 𝒑 0 =𝟎 𝑳 𝑡 𝑖,𝑗 = 1− 𝒘 𝑡 𝑤 𝑖 − 𝒘 𝑡 𝑤 𝑗 𝑳 𝑡−1 𝑖,𝑗 + 𝒘 𝑡 𝑤 𝑖 𝒑 𝑡−1 𝑗 𝐿 0 𝑖,𝑗 =0 ∀𝑖,𝑗 𝐿 𝑡 𝑖,𝑖 =0 ∀𝑖 goes to 0 when write is not null the degree to which the latest valid write operation attends to location 𝑗 close to 1, cut the links from 𝑗 to 𝑖
13
Differentiable Neural Computers – read operation
Temporal memory linkage the agent can choose which direction to read forward weighting 𝒇 𝑡 𝑖 and backward weighting 𝒃 𝑡 𝑖 is 𝒇 𝑡 𝑖 = 𝑳 𝑡 𝒘 𝑡−1 𝑟,𝑖 𝒃 𝑡 𝑖 = 𝑳 𝑡 𝑇 𝒘 𝑡−1 𝑟,𝑖
14
Differentiable Neural Computers – read operation
Read mode each read head can choose which mode to read using 𝝅 𝑡 𝑖 ∈ 0,1 3 resulting read weighting of read head 𝑖 is 𝒘 𝑡 𝑟,𝑖 = 𝝅 𝑡 𝑖 1 𝒃 𝑡 𝑖 + 𝝅 𝑡 𝑖 2 𝒄 𝑡 𝑟,𝑖 + 𝝅 𝑡 𝑖 3 𝒇 𝑡 𝑖 Then, we can iterate on written sequences forward and backward regardless of their actual locations
15
Differentiable Neural Computers – controller
DNCs used a deep LSTM as controller LSTM with multi-layers 𝑥 𝑡 is an input 𝑟 𝑡−1 𝑖 is a read vector of read head 𝑖 at previous time step 𝑣 𝑡 and 𝜉 𝑡 are outputs 𝜉 𝑡 is an interface vector
16
Differentiable Neural Computers – experiments
bAbI question & answering dataset consisting of 20 type of reasoning 10,000 training data, 1,000 test data Graph task training inference, shortest path and traversal on randomly generated graphs test on London Underground and family tree Mini-SHRDLU moving block to satisfy given constraints reinforcement learning
17
Differentiable Neural Computers – bAbI
‘mary journeyed to the kitchen. mary moved to the bathroom. john went back to the hallway. john picked up the milk there. Q: what is john carrying?’ the answer is milk. a lexicon of 159 unique words one-hot vector encoding is used DNC is a classifier here
18
Differentiable Neural Computers – bAbI
19
Differentiable Neural Computers – Graph task
0-999 labels 1) regress the optimal policy 2) 10-time steps of planning 0-9, direct 10-410, relation(not input) check the DNC remember a graph
20
Differentiable Neural Computers – Graph task
logistic regressor input – write vector target – an input triple at that time
21
Differentiable Neural Computers – Graph task
22
Differentiable Neural Computers – Graph task
23
Differentiable Neural Computers – Graph task
24
Differentiable Neural Computers – extra experiments
DNC trained with 256 memory size for traversal fraction of completes over 100 traversal tasks (source node, edge, destination node)
25
Differentiable Neural Computers – mini SHRDLU
reward – the number of satisfied constraints penalty – when taking an invalid action logistic regressor input – contents average vector target – first 5 actions by the agent input dimension *9 7 actions
26
Differentiable Neural Computers – mini SHRDLU
Perfect = minimal moves Success = anyway satisfy all constraints Incomplete = failed to satisfy all constraints
27
Differentiable Neural Computers – conclusion
reasoning about and representing complex data structure is important DNCs can detect variability of tasks maintaining domain regularity the controller learns domain regularity and write variability in memory future direction is to make the model without adapting parameters
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.