Hybrid computing using a neural network with dynamic external memory

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Backpropagation Algorithm
Advertisements

Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Neural NetworksNN 11 Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Reinforcement Learning
Distributed Representations of Sentences and Documents
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
ICS 145B -- L. Bic1 Project: Main Memory Management Textbook: pages ICS 145B L. Bic.
Binary Search From solving a problem to verifying an answer.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Dynamic Background Learning through Deep Auto-encoder Networks Pei Xu 1, Mao Ye 1, Xue Li 2, Qihe Liu 1, Yi Yang 2 and Jian Ding 3 1.University of Electronic.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural Architectures with Memory
Convolutional Sequence to Sequence Learning
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
Semi-Supervised Clustering
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Top 50 Data Structures Interview Questions
Deep Learning Amin Sobhani.
Hybrid computation and the Differentiable Neural Computer
CSE 190 Neural Networks: The Neural Turing Machine
IP Routers – internal view
CSE 190 Neural Networks: How to train a network to look and see
Data Structure and Algorithms
Adversarial Learning for Neural Dialogue Generation
Learning to Generate Networks
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Intelligent Information System Lab
Different Units Ramakrishna Vedantam.
Neural networks (3) Regularization Autoencoder
18-WAN Technologies and Dynamic routing
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Neural Networks and Backpropagation
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
RNNs: Going Beyond the SRN in Language Prediction
Discussion section #2 HW1 questions?
Arrays and Linked Lists
Advanced Recurrent Architectures
RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect
Hidden Markov Models Part 2: Algorithms
A First Look at Music Composition using LSTM Recurrent Neural Networks
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
Logistic Regression & Parallel SGD
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
Other Classification Models: Recurrent Neural Network (RNN)
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Attention.
Neural networks (3) Regularization Autoencoder
Problem Spaces & Search
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Case Injected Genetic Algorithms
Attention for translation
A task of induction to find patterns
Modeling IDS using hybrid intelligent systems
Neural Machine Translation using CNN
A Neural Network for Car-Passenger matching in Ride Hailing Services.
Presentation transcript:

Hybrid computing using a neural network with dynamic external memory Alex Graves, Greg Wayne et al, 2016, Nature Youngnam Kim

Outline This paper proposed an improved version of neural Turing machines The name is the Differentiable Neural Computer a.k.a. DNC The 3 main differences are dynamic memory allocation improved location-based addressing – temporal memory linkage the agent can learn how much to write The presentation is going to address neural Turing machines briefly what are different between NTMs and DNCs experimental results

Neural Turing machines(Alex Graves et al, 2014) imitate Turing machines by memory networks having an external memory M t ∈ ℝ 𝑁×𝑑 , 𝑁 = number of memory locations, 𝑑 = memory vector dimension read and write heads, interaction with memory must be differentiable a controller learns what and where to read and write, generally RNNs are used

𝑒 𝑡 ∈ ℝ 𝑤 is an erase vector, 𝑎 𝑡 ∈ ℝ 𝑤 is an add vector Neural Turing machines – read and write To be differentiable, we do Attention read and write everywhere to the different extent Reading 𝑖 𝑤 𝑡 (𝑖) =1, 0≤ 𝑤 𝑡 𝑖 ≤1 𝑟 𝑡 ← 𝑖 𝑤 𝑡 𝑖 𝑴 𝑡 (𝑖) Writing 𝑴 𝑡 𝑖 ← 𝑴 𝑡−1 𝑖 ⊙ 𝟏− 𝑤 𝑡 𝑖 𝒆 𝑡 + 𝑤 𝑡 𝑖 𝒂 𝑡 𝑒 𝑡 ∈ ℝ 𝑤 is an erase vector, 𝑎 𝑡 ∈ ℝ 𝑤 is an add vector

Neural Turing machines – addressing Addressing – how to produce weights for read and write operations Content-based addressing Location-based addressing a controller produces key vector 𝒌 𝒕 and key strength 𝜷 𝒕 ≥1, then content-based weight 𝑤 𝑡 𝑐 is 𝒘 𝑡 𝑐 = exp⁡{𝛽×𝑆( 𝒌 𝑡 , 𝑴(𝑖))} 𝑗 exp⁡{𝛽×𝑆( 𝒌 𝑡 , 𝑴(𝑗)} 𝑆 𝒖,𝒗 = 𝒖∙𝒗 ‖𝒖‖‖𝒗‖ 𝑆 is a similarity function, generally cosine similarity DNC used the same content-based addressing of NTMs

Neural Turing machines – addressing Location based addressing(different from DNCs) In NTMs, interpolates the content weights 𝒘 𝑡 𝑐 and previous weights 𝒘 𝑡−1 before shift 𝒘 𝑡 𝑔 ← 𝑔 𝑡 𝒘 𝑡 𝑐 + 1− 𝑔 𝑡 𝒘 𝑡−1 where an interpolation gate 𝑔 𝑡 is a scalar in the range (0,1) after interpolation, shifts 𝒘 𝑡 𝑐 using shift distribution 𝒔 𝑡 𝑤 𝑡 𝑖 = 𝑗=0 𝑁−1 𝑤 𝑡 𝑔 𝑗 𝑠 𝑡 (𝑖−𝑗) to avoid leakage and dispersion of weightings, use sharpening parameter 𝛾 𝑡 ≥1 𝑤 𝑡 𝑖 ← 𝑤 𝑡 𝑖 𝛾 𝑡 𝑗 𝑤 𝑡 𝑗 𝛾 𝑡

Neural Turing machines – addressing an example of shift weightings disadvantages: we can iterate on only adjacent elements

Differentiable Neural Computers – architecture

Differentiable Neural Computers – write operation dynamic memory allocations the agent learns deciding whether the location to be freed in which it reads to do this, the read head produces an allocation vector 𝒂 𝑡 ∈ 0,1 𝑁 when usage vector 𝑢 𝑡 𝑖 is close to 0, indicates 𝑖-th memory location being free 𝝍 𝑡 = 𝑖=1 𝑅 (𝟏− 𝑓 𝑡 𝑖 𝒘 𝑡−1 𝑟,𝑖 ) 𝒖 𝑡 = 𝒖 𝑡−1 + 𝒘 𝑡−1 𝑤 − 𝒖 𝑡−1 ⊙ 𝒘 𝑡−1 𝑤 ⊙ 𝝍 𝑡 𝒂 𝑡 𝝓 𝑡 𝑗 = 1− 𝒖 𝑡 𝝓 𝑡 𝑗 𝑖=1 𝑗−1 𝒖 𝑡 [ 𝝓 𝑡 [𝑖]] overwriting how much free the location is force to use locations more free where 𝝍 𝑡 is a retention vector, 𝑓 𝑡 𝑖 is a free gate of read head 𝑖 and 𝝓 𝑡 is free list 𝒘 𝑡−1 𝑟,𝑖 is a 𝑖-th read weighting of previous time step and 𝒘 𝑡−1 𝑤 is a write weighting free list 𝜙 𝑡 is a sorted list of index in ascending order of usage

Differentiable Neural Computers – write operation interpolating content weighting 𝒄 𝑡 𝑤 and allocation weightings 𝒂 𝑡 𝒘 𝑡 𝑤 = 𝑔 𝑡 𝑤 [ 𝑔 𝑡 𝑎 𝒂 𝑡 + 1− 𝑔 𝑡 𝑎 𝒄 𝑡 𝑤 ] where 𝑔 𝑡 𝑤 is a write gate and 𝑔 𝑡 𝑎 is an allocation gate

copy of 10 sequences of length 5 Differentiable Neural Computers – write operation copy of 10 sequences of length 5 with memory size 10

Differentiable Neural Computers – read operation Temporal memory linkage after write operation, we can store information about the order in which the data are written here, let the linkage matrix 𝐿∈ 0,1 𝑁×𝑁 𝐿[𝑖,𝑗] represent the degree to which location 𝑖 was the location written to after location 𝑗 𝒑 𝑡 = 1− 𝑖 𝒘 𝑡 𝑤 𝑖 𝒑 𝑡−1 + 𝒘 𝑡 𝑤 , 𝒑 0 =𝟎 𝑳 𝑡 𝑖,𝑗 = 1− 𝒘 𝑡 𝑤 𝑖 − 𝒘 𝑡 𝑤 𝑗 𝑳 𝑡−1 𝑖,𝑗 + 𝒘 𝑡 𝑤 𝑖 𝒑 𝑡−1 𝑗 𝐿 0 𝑖,𝑗 =0 ∀𝑖,𝑗 𝐿 𝑡 𝑖,𝑖 =0 ∀𝑖 goes to 0 when write is not null the degree to which the latest valid write operation attends to location 𝑗 close to 1, cut the links from 𝑗 to 𝑖

Differentiable Neural Computers – read operation Temporal memory linkage the agent can choose which direction to read forward weighting 𝒇 𝑡 𝑖 and backward weighting 𝒃 𝑡 𝑖 is 𝒇 𝑡 𝑖 = 𝑳 𝑡 𝒘 𝑡−1 𝑟,𝑖 𝒃 𝑡 𝑖 = 𝑳 𝑡 𝑇 𝒘 𝑡−1 𝑟,𝑖

Differentiable Neural Computers – read operation Read mode each read head can choose which mode to read using 𝝅 𝑡 𝑖 ∈ 0,1 3 resulting read weighting of read head 𝑖 is 𝒘 𝑡 𝑟,𝑖 = 𝝅 𝑡 𝑖 1 𝒃 𝑡 𝑖 + 𝝅 𝑡 𝑖 2 𝒄 𝑡 𝑟,𝑖 + 𝝅 𝑡 𝑖 3 𝒇 𝑡 𝑖 Then, we can iterate on written sequences forward and backward regardless of their actual locations

Differentiable Neural Computers – controller DNCs used a deep LSTM as controller LSTM with multi-layers 𝑥 𝑡 is an input 𝑟 𝑡−1 𝑖 is a read vector of read head 𝑖 at previous time step 𝑣 𝑡 and 𝜉 𝑡 are outputs 𝜉 𝑡 is an interface vector

Differentiable Neural Computers – experiments bAbI question & answering dataset consisting of 20 type of reasoning 10,000 training data, 1,000 test data Graph task training inference, shortest path and traversal on randomly generated graphs test on London Underground and family tree Mini-SHRDLU moving block to satisfy given constraints reinforcement learning

Differentiable Neural Computers – bAbI ‘mary journeyed to the kitchen. mary moved to the bathroom. john went back to the hallway. john picked up the milk there. Q: what is john carrying?’ the answer is milk. a lexicon of 159 unique words one-hot vector encoding is used DNC is a classifier here

Differentiable Neural Computers – bAbI

Differentiable Neural Computers – Graph task 0-999 labels 1) regress the optimal policy 2) 10-time steps of planning 0-9, direct 10-410, relation(not input) check the DNC remember a graph

Differentiable Neural Computers – Graph task logistic regressor input – write vector target – an input triple at that time

Differentiable Neural Computers – Graph task

Differentiable Neural Computers – Graph task

Differentiable Neural Computers – Graph task

Differentiable Neural Computers – extra experiments DNC trained with 256 memory size for traversal fraction of completes over 100 traversal tasks (source node, edge, destination node)

Differentiable Neural Computers – mini SHRDLU reward – the number of satisfied constraints penalty – when taking an invalid action logistic regressor input – contents average vector target – first 5 actions by the agent input dimension 26+6+4+6+6*9 7 actions

Differentiable Neural Computers – mini SHRDLU Perfect = minimal moves Success = anyway satisfy all constraints Incomplete = failed to satisfy all constraints

Differentiable Neural Computers – conclusion reasoning about and representing complex data structure is important DNCs can detect variability of tasks maintaining domain regularity the controller learns domain regularity and write variability in memory future direction is to make the model without adapting parameters