Download presentation
Presentation is loading. Please wait.
1
A unified extension of lstm to deep network
Grid LSTM A unified extension of lstm to deep network Today I will introduce a new architecture called grid LSTM that makes a unified extension of good properties of LSTM to deep computation. Presented by Kaiwen Wu Grid LSTM 2/25/19
2
Arch Expr LSTM Highway Grid Trans Roadmap
There are also other related network architectures. However, they are not explained in detail in the paper, nor will they be explained in this presentation, to avoid unnecessary confusion and to save time. Grid LSTM 2/25/19
3
Recap LSTM input gate forget gate output gate modulated input
time input gate forget gate output gate modulated input update memory output new hidden depth I in “I * x” is not an identity matrix! LSTM should have been quite familiar as it has occurred so many times till now. Basically, it computes at each iteration three gates, the input, forget, and output gates, and the modulated input. The memory is updated by "forgetting" some original memory followed by adding new memory. The next hidden state, extracted from the updated memory, is served as the output. Note how LSTM tackles vanishing gradient. When updating memory, there's no sigmoid-like functions. Therefore, gradient won't vanish even if propagated for long time. Another benefits along time axis is that the subsequent processing relies on the memory rather than the input, which facilitates learnable routing. So far, the single LSTM cell cannot compute into depth. No sigmoid-like function when updating memory. Therefore no vanishing gradient in time axis. Grid LSTM 2/25/19
4
Adding Depth to LSTM A natural extension of LSTM is Stacked LSTM
The depth cannot be large to prevent vanishing gradient However, it loses the dynamic routing ability along depth axis time depth To make LSTM deeper, a natural solution is the stacked LSTM, where LSTM cells are simply stacked together. However, apparently along depth axis, vanishing gradient could happen as depth increases, which is why stacked LSTM is typically shallow. Skip connection is known to be beneficial in training deep stacked LSTM. However, it does not add the learnable routing ability to the depth axis. Grid LSTM 2/25/19
5
Highway network (approx.)
Applying LSTM to depth axis rather than time axis yields highway network. depth highway network: trainable up to 900 layers Highway network puts LSTM-like gates along depth axis, inheriting all the benefits of LSTM along depth. This way, highway network can be trained up to 900 layers without vanishing gradient. This work inspired the introduction of grid LSTM. In fact, the image shown here is 1D LSTM; it's highway network. We won't look into highway network either in this lecture. Grid LSTM 2/25/19
6
Grid LSTM LSTM connection along any or all dimensions
Inputs to any or all dimensions Each LSTM dimension maintains its own memory Grid LSTM is defined as having LSTM connections along one or more axes of the processing, with each LSTM dimension maintaining its own memory. Nothing more. The figure shows a 2D grid LSTM, which is basically appending the 1D grid LSTM along time axis. Grid LSTM 2/25/19
7
Grid LSTM update equations
The update rule is quite simple. The memory and hidden vector along each axis is updated on itself. Grid LSTM 2/25/19
8
Flexibility Evaluation order of each dimension (Priority Dimensions)
Feedforward connection along some dimensions (Non-LSTM Dimensions) Weight sharing No details here. There’s not much to explain. Basically the user can flexibly make advantage of the Grid LSTM to suit the need. The use of grid LSTM is very flexible. To name a few, the evaluation of each axis can be either independent, or dependent on others, thus forming a partial order. The LSTM connection along some axes can be replaced by feedforward connections. The weight can be shared in other dimensions the same as along time dimension. As such. Grid LSTM 2/25/19
9
Experiment: Translation
Existing work: seq2seq (bottleneck problem) => attention 3D Grid LSTM in translation: The paper gives quite a number of experiments to showcase the effectiveness of grid LSTM. We won't go into details of them, since most experiments are merely straight application. I'll briefly explain how they apply grid LSTM to translation. Seq2seq model used to suffer from bottleneck problem. Attention mechanism alleviates it and greatly enhances translation quality. Here, the authors feed the source language to one side and the target language to another side. Essentially, the network learns to generate the target sentence with attention on the source sentence. The time complexity is the same as the soft attention. Grid LSTM 2/25/19
10
Open-end Question What other applications of Grid LSTM can you think of? Grid LSTM 2/25/19
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.