Grid Long Short-Term Memory Nal Kalchbrenner, Ivo Danihelka, Alex Graves Google DeepMind 2015/07/29 Ming-Han Yang 1
Outline Abstract Introduction Background Grid LSTM Experiments Long Short-Term Memory (LSTM) Stacked LSTM (SLSTM) Multidimensional LSTM (MDLSTM) Grid LSTM Experiments Conclusion
Abstract This paper introduces Grid Long Short-Term Memory, 6 tasks: a network of LSTM cells arranged in a multidimensional grid that can be applied to vectors, sequences or higher dimensional data such as images 6 tasks: We apply the model to algorithmic tasks such as integer addition and determining the parity of random binary vectors. It is able to solve these problems for 15-digit integers and 250-bit vectors respectively 2D Grid LSTM achieves 1.47 bits-per-character in the 100M characters Wikipedia dataset outperforming other neural networks a two-dimensional translation model based on Grid LSTM outperforms a phrase-based reference phrase-based CDEC system on a Chinese-to-English translation task 3D Grid LSTM yields a near state-of-the-art error rate of 0.32% on MNIST 6種task Parity (XOR, 奇數偶數個1) 位元數加法 記憶隨機的symbol sequence Charcter level 語言模型 Translation MNIST 1
Introduction Long Short-Term Memory (LSTM) networks are recurrent neural networks equipped with a special gating mechanism that controls access to memory cells LSTM networks preserve signals and propagate errors for much longer than ordinary recurrent neural networks the Vanishing Gradient Problem : Deep networks suffer from exactly the same problems as recurrent networks applied to long sequences namely that information from past computations rapidly attenuates as it progresses through the chain and that each layer cannot dynamically select or ignore its inputs It therefore seems attractive to generalise the advantages of LSTM to deep computation Lstm是一種加上gate的特殊的RNN, gate可以防止訓練時其他網路更新它的memory的值, 所以LSTM可以保留更多的資訊(?), 每個cell的讀寫忘都是獨立的, 而且gate可以學習什麼時候要忘記, 這些特性讓lstm可以處理複雜的數據及分離的相互依存關係, 在語音辨識中, 手寫辨識, 機器翻譯, 跟影像字幕生成的領域都有卓越的表現 DNN與RNN都有vanshing gradient(梯度消失)的問題, 他們的每一層都不能動態地選擇什麼時候要忘記他的輸入值 Lstm的優勢 1
Introduction We extend LSTM cells to deep networks within a unified architecture. We introduce Grid LSTM, a network that is arranged in a grid of one or more dimensions. The network has LSTM cells along any or all of the dimensions of the grid. The depth dimension is treated like the other dimensions and also uses LSTM cells to communicate directly from one layer to the next. One-dimensional Grid LSTM corresponds to a feed-forward network that uses LSTM cells. These networks are related to Highway Networks where a gated transfer function is used to successfully train feedforward networks with up to 900 layers of depth Grid LSTM with two dimensions is analogous to the Stacked LSTM, but it adds cells along the depth dimension too. Grid LSTM with three or more dimensions is analogous to Multidimensional LSTM, but differs from it not just by having the cells along the depth dimension 1層的GLSTM跟一般LSTM相同 2層 -> 類似Stacked LSTM, 差別在SLSTM在深度沒有cell的連線, 只有橫的 3層 -> 類似o Multidimensional LSTM, 同上 1
Background - LSTM Input sequences and target pairs : 𝑥 𝑖 , 𝑦 𝑖 , 𝑖=1~𝑚 Previous input : 𝑥 1 , …, 𝑥 𝑖 Past input : 𝑥 1 , …, 𝑥 𝑖−1 hidden vector 𝐡∈ ℝ 𝑑 , memory vector 𝐦∈ ℝ 𝑑 W 𝑢 , W 𝑓 , W 𝑜 , W 𝑐 ∈ ℝ 𝑑×2𝑑 𝐇∈ ℝ 2𝑑 the concatenation of the new input 𝑥 𝑖 , transformed by a projection matrix 𝐼, and the previous hidden vector 𝐡 𝐇= 𝐼∗ 𝑥 𝑖 𝐡 𝐠 𝑓 : forget gate delete parts of the previous memory vector 𝐦 𝑖−1 𝐠 𝑐 : content write new content to the new memory 𝐦 𝑖−1 by 𝐠 𝑢 𝐠 𝑢 : input gate 𝐡 ′ , 𝐦 ′ =LSTM 𝐇, 𝐦, 𝐖 Each memory vector is obtained by a linear transformation of the previous memory vector and the gates; 1
Background – SLSTM Stacked LSTM adds capacity by stacking LSTM layers on top of each other. Note that although the LSTM cells are present along the sequential computation of each LSTM network, they are not present in the vertical computation from one layer to the next.
Background – MDLSTM Here the inputs are not arranged in a sequence, but in a 𝑁-dimensional grid such as the two-dimensional grid of pixels in an image. At each input 𝑥 in the array the network receives 𝑁 hidden vectors 𝐡 1 , …, 𝐡 𝑁 and 𝑁 memory vectors 𝐦 1 , …, 𝐦 𝑁 and computes a hidden vector 𝐡 and a memory vector 𝐦 that are passed as the next state for each of the 𝑁 dimensions. The network concatenates the transformed input 𝐼∗ 𝑥 𝑖 and the 𝑁 hidden vectors 𝐡 1 , …, 𝐡 𝑁 into a vector 𝐇 and as in Eq. 1 維度變多, 連線也會變多 ∵ summation沒有限制 , ∴ 𝐦的值也會越來越大, 造成網路不穩定 This motivates the simple alternate way of computing the output memory vectors in the Grid LSTM.
Grid LSTM - Architecture Grid LSTM Blocks Priority Dimensions Non-LSTM Dimensions Inputs from Multiple Sides Weight Sharing 在預測序列的時候, GLSTM的cell有兩維. 一個是時間軸, 另一個是深度 1
Grid LSTM
Grid LSTM - Blocks Grid LSTM deploys cells along any or all of the dimensions including the depth of the network. The computation is simple and proceeds as follows. The model first concatenates the input hidden vectors from the N dimensions: Then the block computes 𝑁 transforms LSTM(; ; ), one for each dimension, obtaining the desired output hidden and memory vectors: 在預測序列的時候, GLSTM的cell有兩維. 一個是時間軸, 另一個是深度 1
Grid LSTM - Priority Dimensions In a N-dimensional block the transforms for all dimensions are computed in parallel But it can be useful for a dimension to know the outputs of the transforms from the other dimensions, especially if the outgoing vectors from that dimension will be used to estimate the target. For instance, to prioritize the first dimension of the network, the block first computes the 𝑁−1 transforms for the other dimensions obtaining the output hidden vectors 𝐡 2 ′ , …, 𝐡 𝑁 ′ . Then the block concatenates these output hidden vectors and the input hidden vector 𝐡 1 for the first dimension into a new vector 𝐇 ′ as follows: The vector is then used in the final transform to obtain the prioritized output hidden and memory vectors 𝐡 1 ′ and 𝐦 1 ′ . 在預測序列的時候, GLSTM的cell有兩維. 一個是時間軸, 另一個是深度 1
Grid LSTM - Non-LSTM Dimensions In Grid LSTM networks that have only a few blocks along a given dimension in the grid, it can be useful to just have regular connections along that dimension without the use of cells. Given a weight matrix 𝐕∈ ℝ 𝑑×𝑛𝑑 , for the first dimension this looks as follows: 𝛼 is a standard nonlinear transfer function or simply the identity. This allows us to see how, modulo the differences in the mechanism inside the blocks. A 2d Grid LSTM applied to temporal sequences with cells in the temporal dimension but not in the vertical depth dimension, corresponds to the Stacked LSTM. Likewise, the 3d Grid LSTM without cells along the depth corresponds to Multidimensional LSTM, stacked with one or more layers. 在預測序列的時候, GLSTM的cell有兩維. 一個是時間軸, 另一個是深度 1
Grid LSTM - Inputs from Multiple Sides If we picture a 𝑁-dimensional block, we see that 𝑁 of the sides of the block have input vectors associated with them and the other 𝑁 sides have output vectors. As the blocks are arranged in a grid, this separation extends to the grid as a whole; each side of the grid has either input or output vectors associated with it. In certain tasks that have inputs of different types, a model can exploit this separation by projecting each type of input on a different side of the grid. The mechanism inside the blocks ensures that the hidden and memory vectors from the different sides will interact closely without being conflated. source words and target words are projected on two different sides of a Grid LSTM. (in Experiment 4.5) 在預測序列的時候, GLSTM的cell有兩維. 一個是時間軸, 另一個是深度 1
Grid LSTM – Weight Sharing Sharing of weight matrices can be specified along any dimension in a Grid LSTM and it can be useful to induce invariance in the computation along that dimension As in the translation and image models: if multiple sides of a grid need to share weights, capacity can be added to the model by introducing into the grid a new dimension without sharing of weights 在預測序列的時候, GLSTM的cell有兩維. 一個是時間軸, 另一個是深度 1
Experiments - Parity We apply one-dimensional Grid LSTM to learning parity. Given a string 𝑏 1 , …, 𝑏 𝑘 of 𝑘 bits 0 or 1 Parity / XOR Sum of the bits = Odd -> 1 Sum of the bits = Even -> 0 The 1-LSTM networks are trained on input strings that have from 𝑘 = 20 to 𝑘 = 250 bits in increments of 10. [上半圖] 左圖是一層的GLSTM+500個neuron, 右圖是一層GLSTM+1500個neuron 其中, X軸是bits的個數 Y軸是layer的層數, 每個點表示分類100趴正確, (樣本是100個k-bits字串) [下半圖] 左下的表layer和hidden unit的關係 右下的視覺化memory vector的圖, 從deed forward的1-GLSTM中的值, 是25層的1-GLSTM, 訓練在50bits的字串, 這個例子是10個0+40個1 1
Experiments - Addition We next experiment with 2-LSTM networks on learning to sum two 15-digit integers We compare the performance of 2-LSTM networks with that of standard Stacked LSTM We train the two types of networks with either tied or untied weights, with 400 hidden units each and with between 1 and 50 layers. We train the network with stochastic gradient descent using minibatches of size 15 and the Adam optimizer with a learning rate of 0.001. 2GLSTM來學習兩個15bits的加法, 問題在於每次只給網路1個digit, 預測結果也是預測一個digit 沒有使用curriculum learning, 並且沒有把部分預測的output放回網路, 強迫網路要記住部分預測的結果 比較tied跟untied GLSTM最佳的效果tied GLSTM是18層, 參數才550K左右, 1
Experiments - Memorization We analyze the performance of 2-LSTM networks on the task of memorizing a random sequence of symbols. We train 2-LSTM and Stacked LSTM with either tied or untied weights on the memorization task. We train each network for up to 5 million samples or until they reach 100% accuracy on 100 unseen samples. The small number of hidden units contributes to making the training of the networks difficult. But we see that tied 2-LSTM networks are most successful and learn to solve the task with the smallest number of samples. 和上一個實驗類似 Sequence長度有20個symbol, 詞典有64個symbol, 並使用one-hot representation, 每次step給網路一個symbol 所有的網路都有100個neuron, 有50層, 也沒有 [如圖] 網路的Performance, 越少的hidden unit讓訓練網路更難, 從圖中可看出tied 2-GLSTM的表現最好, 很少的樣本也能有很好的效果 1
Experiments – Character-Level Language Modeling We next test the 2-LSTM network on the Hutter challenge Wikipedia dataset . The aim is to successively predict the next character in the corpus. The dataset has 100 million characters. the last 5 million characters are used for testing As usual the objective is to minimize the negative log-likelihood of the character sequence under the model bits-per-character performance Tied 2gLSTM 有6層, 總共有1000個neuron 1
Experiments – Translation In the neural approach to machine translation one trains a neural network end-to-end to map the source sentence to the target sentence The mapping is usually performed within the encoder-decoder framework. A neural network, that can be convolutional or recurrent, first encodes the source sentence and then the computed representation of the source conditions a recurrent neural network to generate the target sentence. 1
Experiments – Translation We use Grid LSTM to view translation in a novel fashion as a two-dimensional mapping. 1
Experiments – MNIST Digit Recognition In our last experiment we apply a 3-LSTM network to images. We consider non-overlapping patches of pixels in an image as forming a two-dimensional grid of inputs. The 3-LSTM performs computations with LSTM cells along three different dimensions. [左圖] 3D GLSTM網路 每個patch會透過input hidden vector及cell vector投影 沒有使用sub-sampling或pooling, 跨空間維度表示運算的順序 [右圖] 除了GLSTM跟Visin以外都是cnn Vinsin把單向的RNN疊很多層來做 在這邊GLSTM在深度的沒有連線, 而是採用RELU 1
Conclusion We have introduced Grid LSTM, a network that uses LSTM cells along all of the dimensions and modulates in a novel fashion the multi-way interaction. We have seen the advantages of the cells compared to regular connections in solving tasks such as parity, addition and memorization. We have described powerful and flexible ways of applying the model to character prediction, machine translation and image classification, showing strong performance across the board.
1
Background - LSTM 1 Each memory vector is obtained by a linear transformation of the previous memory vector and the gates; 1
Background - LSTM Each memory vector is obtained by a linear transformation of the previous memory vector and the gates; 1