End-To-End Memory Networks Presenters: Vicente Ordonez, Paola Cascante
Motivation Two grand challenges in AI: Models that can make multiple computational steps to answer a question or completing a task Models that can describe long term dependencies in sequential data
Synthetic QA Experiments
BaBI dataset
Approach Model takes as input x1,...,xn (to store in memory), query q and outputs answer a
Proposal RNN architecture where the recurrence reads from a possibly large external memory multiple times before outputting a symbol. Continuity of the model: means it can be trained end-to-end from input-output pairs. Uses a global memory, with shared read and write function. Multiple computational steps: “hops”.
Proposal Models that use explicit storage and a notion of attention: Continuous form of Memory Networks RNNSearch with multiple computational steps per output symbol
Synthetic QA Experiments There are a total of 20 different types of tasks that probe different forms of reasoning and deduction. A task is a set of example problems. A problem is a set of I sentences {xi} where I ≤ 320; a question sentence q and answer a. The vocabulary is of size V = 177. Two versions of the data are used, one that has 1000 training problems per task and a second larger one with 10,000 per task.
Model Details K = 3 hops where used Adjacent weight sharing The output embedding for one layer is the input embedding for the one above, i.e. The answer prediction matrix to be the same as the final output embedding, i.e The question embedding is the same as the input embedding of the first layer, i.e.
Model Details Two different sentence representations: A bag-of-words (BoW) representation that takes the sentence embeds each word and sums the resulting vectors: e.g and Position Encoding (PE) that encodes the position of words within the sentence Temporal encoding: Modify the memory vector so that TA(i) is the ith row of a special matrix TA that encodes temporal information
Model Details Learning time invariance by injecting random noise: Add “dummy” memories to regularize TA Random Noise: add 10% of empty memories to the stories
Training Details 10% of the bAbI training set was held-out to form a validation set Learning rate of η = 0.01 Annealing every 25 epochs by η/2 until 100 epochs were reached The weights were initialized randomly from a Gaussian distribution with zero mean and σ = 0.1
Baselines MemNN: The strongly supervised AM+NG+NL Memory Networks approach MemNN-WSH: A weakly supervised heuristic version of MemNN where the supporting sentence labels are not used in training LSTM: A standard LSTM model, trained using question / answer pairs only
Results BoW vs Position Encoding (PE) sentence representation Training on all 20 tasks independently vs jointly training LS vs training with softmaxes from the start Varying memory hops from 1 to 3
Example predictions on the QA tasks
Language Modeling Experiments The perplexity on the test sets of Penn Treebank and Text8 corpora
Language Modeling Experiments Average activation weight of memory positions during 6 memory hops White color indicates where the model is attending during the kth hop
Related Work LSTM-based models through local memory cells. Stack augmented recurrent nets: push and pop operations Neural Turing Machine of Graves which uses both content and address- based access Bidirectional RNN based encoder and gated RNN based decoder were used for machine translation Combinations of n-grams with a cache
Conclusions Recurrent attention mechanism for reading the memory can be successfully trained via backpropagation on diverse tasks. There is no supervision of supporting facts It slightly outperforms tuned RNNs and LSTMs of comparable complexity
Conclusions The model is still unable to match the performance of the memory networks trained with strong supervision Smooth lookups may not scale well to the case where a larger memory is required Extra material: Oral Session: End-To-End Memory Networks https://www.youtube.com/watch?v=ZwvWY9Yy76Q