Neural Machine Translation using CNN

Neural Machine Translation using CNN
- Shashank Rajput

Introduction Neural machine translation has recently caught worldwide attention with Google claiming to have achieved almost human level translation accuracy [11]. A team from Facebook claim to have surpassed Google’s accuracy using CNNs in machine translation[1]. The encoder-decoder based Neural Machine Translation architecture comprises of two units: Encoder : Processes the source sentence and stores the understanding in some vector(s). Decoder : Uses the vector(s) generated by the encoder to output the translated sentence word by word (one word at a time).

CNN based architecture
Works similar to CNN on images - sentence treated as 1D image. Order of magnitude faster than RNN for this task because of parallelization and have achieved (and surpassed) RNN scores in translation. See [12] for explanation with animation.

Proposal Can we vary number of layers in encoder at run time?
That is, can we know how much to process an input, at run time. Turns out that varying layers in encoder does not affect the result much [1]. But, that lead to an intriguing question – can we know the relative importance of each layer's output amongst output of all the layers? Similar to a committee machine Didn’t find any existing work that does similar thing.

Challenges We have variable number of words, thus the output size of each layer is variable. How do we provide fixed size input to gating network when output size of each layer is variable? Tried mean, max and sum across the words... Sum worked the best. Need to do a more detailed analysis about how to get state of a layer, why the sum worked. Positional embeddings were added for each layer so that the gating network knows what is the index of a layer Random Initialization of weights helped a lot (see next slide)

Results Validation error when gating network weights initialization with 0, stuck on local minima (was giving most of the weight to the first layer only) Validation error when gating network weights initialization randomly Validation error for original network (state of the art) Although there is not much difference in error rates, our model did achieve the same performance. With some tuning and experimenting, we might improve upon the baseline (which is currently state of the art)

Analysis and future work
Related to ResNet – In ResNet, the output of each layer is weighted and added to the final output. In our case also the output of each layer is weighted and added to the final output, but the weights in our case are computed dynamically, depending on the input and outputs. The network did compute different weights for different input sentences. Difficult to see what they mean. Try on other networks – I could not find similar work done for general neural networks. We can try to apply this approach instead of Resnet and see the results.

References [1] Convolutional Sequence to Sequence Learning - [2] Adaptive computation time in RNN - [3] Current state of art RNN NMT - [4] Attention in RNN Encoder - [5] Attention walks on images - (some follow up papers on images also published) [6] Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation- [7] Addressing the RareWord Problem in NeuralMachine Translation - [8] On Using Very Large Target Vocabulary for Neural Machine Translation - [9] Bleu metric - [10] WMT site - [11] Google blog - [12] Source code for convolutional sequence to sequence learning -

Neural Machine Translation using CNN

Similar presentations

Presentation on theme: "Neural Machine Translation using CNN"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neural Machine Translation using CNN

Similar presentations

Presentation on theme: "Neural Machine Translation using CNN"— Presentation transcript:

Similar presentations

About project

Feedback