Neural Machine Translation using CNN

Slides:



Advertisements
Similar presentations
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Advertisements

Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 21, Week 101 More on DEEP ANNs –Convolution –Max Pooling –Drop Out Final ANN Wrapup FYI:
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Bassem Makni SML 16 Click to add text 1 Deep Learning of RDF rules Semantic Machine Learning.
Fill-in-The-Blank Using Sum Product Network
Welcome deep loria !.
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
Gist of Achieving Human Parity in Conversational Speech Recognition
SUNY Korea BioData Mining Lab - Journal Review
RNNs: An example applied to the prediction task
Faster R-CNN – Concepts
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
The Relationship between Deep Learning and Brain Function
Deep Learning Amin Sobhani.
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Recurrent Neural Networks for Natural Language Processing
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Matt Gormley Lecture 16 October 24, 2016
Spring Courses CSCI 5922 – Probabilistic Models (Mozer) CSCI Mind Reading Machines (Sidney D’Mello) CSCI 7000 – Human Centered Machine Learning.
Attention Is All You Need
Intelligent Information System Lab
Intro to NLP and Deep Learning
Different Units Ramakrishna Vedantam.
Convolutional Networks
Deep Belief Networks Psychology 209 February 22, 2013.
Neural Machine Translation By Learning to Jointly Align and Translate
CS6890 Deep Learning Weizhen Cai
Random walk initialization for training very deep feedforward networks
Hybrid computing using a neural network with dynamic external memory
Attention Is All You Need
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
By: Kevin Yu Ph.D. in Computer Engineering
RNNs: Going Beyond the SRN in Language Prediction
Attention-based Caption Description Mun Jonghwan.
Grid Long Short-Term Memory
RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect
Paraphrase Generation Using Deep Learning
Recurrent Neural Networks
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
Final Presentation: Neural Network Doc Summarization
Understanding LSTM Networks
The Big Health Data–Intelligent Machine Paradox
Neural Networks Geoff Hulten.
Lecture: Deep Convolutional Neural Networks
Lip movement Synthesis from Text
Forward and Backward Max Pooling
Machine Translation(MT)
RNNs: Going Beyond the SRN in Language Prediction
Graph Neural Networks Amog Kamsetty January 30, 2019.
Attention.
Problems with CNNs and recent innovations 2/13/19
实习生汇报 ——北邮 张安迪.
Word2Vec.
Please enjoy.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Presented by: Anurag Paul
Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.
Neural Machine Translation
Presented By: Harshul Gupta
End-to-End Facial Alignment and Recognition
Bidirectional LSTM-CRF Models for Sequence Tagging
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Neural Machine Translation using CNN - Shashank Rajput

Introduction Neural machine translation has recently caught worldwide attention with Google claiming to have achieved almost human level translation accuracy [11]. A team from Facebook claim to have surpassed Google’s accuracy using CNNs in machine translation[1]. The encoder-decoder based Neural Machine Translation architecture comprises of two units: Encoder : Processes the source sentence and stores the understanding in some vector(s). Decoder : Uses the vector(s) generated by the encoder to output the translated sentence word by word (one word at a time).

CNN based architecture Works similar to CNN on images - sentence treated as 1D image. Order of magnitude faster than RNN for this task because of parallelization and have achieved (and surpassed) RNN scores in translation. See [12] for explanation with animation.

Proposal Can we vary number of layers in encoder at run time? That is, can we know how much to process an input, at run time. Turns out that varying layers in encoder does not affect the result much [1]. But, that lead to an intriguing question – can we know the relative importance of each layer's output amongst output of all the layers? Similar to a committee machine Didn’t find any existing work that does similar thing.

Challenges We have variable number of words, thus the output size of each layer is variable. How do we provide fixed size input to gating network when output size of each layer is variable? Tried mean, max and sum across the words... Sum worked the best. Need to do a more detailed analysis about how to get state of a layer, why the sum worked. Positional embeddings were added for each layer so that the gating network knows what is the index of a layer Random Initialization of weights helped a lot (see next slide)

Results Validation error when gating network weights initialization with 0, stuck on local minima (was giving most of the weight to the first layer only) Validation error when gating network weights initialization randomly Validation error for original network (state of the art) Although there is not much difference in error rates, our model did achieve the same performance. With some tuning and experimenting, we might improve upon the baseline (which is currently state of the art)

Analysis and future work Related to ResNet – In ResNet, the output of each layer is weighted and added to the final output. In our case also the output of each layer is weighted and added to the final output, but the weights in our case are computed dynamically, depending on the input and outputs. The network did compute different weights for different input sentences. Difficult to see what they mean. Try on other networks – I could not find similar work done for general neural networks. We can try to apply this approach instead of Resnet and see the results.

References [1] Convolutional Sequence to Sequence Learning - https://arxiv.org/pdf/1705.03122.pdf [2] Adaptive computation time in RNN - https://arxiv.org/pdf/1603.08983.pdf [3] Current state of art RNN NMT - https://arxiv.org/pdf/1609.08144.pdf [4] Attention in RNN Encoder - https://arxiv.org/pdf/1409.0473.pdf [5] Attention walks on images - http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf (some follow up papers on images also published) [6] Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation- http://arxiv.org/abs/1606.04199 [7] Addressing the RareWord Problem in NeuralMachine Translation - https://arxiv.org/pdf/1410.8206.pdf [8] On Using Very Large Target Vocabulary for Neural Machine Translation - http://www.aclweb.org/anthology/P15-1001 [9] Bleu metric - https://en.wikipedia.org/wiki/BLEU [10] WMT site - http://www.statmt.org/wmt14/translation-task.html [11] Google blog - https://research.googleblog.com/2016/09/a-neural-network-for-machine.html [12] Source code for convolutional sequence to sequence learning - https://github.com/facebookresearch/fairseq