Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Slides:



Advertisements
Similar presentations
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Advertisements

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Addressing the Rare Word Problem in Neural Machine Translation
Neural Net Language Models
Haitham Elmarakeby.  Speech recognition
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Attention Model in NLP Jichuan ZENG.
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Neural Machine Translation
TensorFlow– A system for large-scale machine learning
Convolutional Sequence to Sequence Learning
SUNY Korea BioData Mining Lab - Journal Review
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Joint Training for Pivot-based Neural Machine Translation
Intelligent Information System Lab
Intro to NLP and Deep Learning
Statistical NLP: Lecture 13
Neural networks (3) Regularization Autoencoder
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
Neural Machine Translation By Learning to Jointly Align and Translate
Attention Is All You Need
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Deep Learning based Machine Translation
RNNs: Going Beyond the SRN in Language Prediction
Grid Long Short-Term Memory
Final Presentation: Neural Network Doc Summarization
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Understanding LSTM Networks
Code Completion with Neural Attention and Pointer Networks
Memory-augmented Chinese-Uyghur Neural Machine Translation
Neural Speech Synthesis with Transformer Network
Machine Translation(MT)
Report by: 陆纪圆.
RNNs: Going Beyond the SRN in Language Prediction
Attention.
Kyoto University Participation to WAT 2016
Neural networks (3) Regularization Autoencoder
Word embeddings (continued)
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
RNNs and Sequence to sequence models
Neural Joint Model for Transition-based Chinese Syntactic Analysis
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Automatic Handwriting Generation
A unified extension of lstm to deep network
Presented by: Anurag Paul
Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Neural Machine Translation using CNN
Learning and Memorization
Neural Machine Translation
Search-Based Approaches to Accelerate Deep Learning
Sequence-to-Sequence Models
Bidirectional LSTM-CRF Models for Sequence Tagging
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Listen Attend and Spell – a brief introduction
Presentation transcript:

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Outline Introduction and Related works Model Architecture and Implemantation Details Residual Connections and Bi-directional Encoder for First Layer Model Parallelism Segmentation Training Quantizable Model and Quantized Inference Decoder Experiments and Results Conclusion

Introduction and Related works Statistical Machine Translation (SMT) Word-based Machine Translation Phrase-based Machine Translation (PBMT) Neural Machine Translation (NMT)

Model Architecture To achieve good accuracy both encoder and decoder RNNs have to be deep enough to capture subtle irregularities in the source and target languages 8-layer LSTMs are used in both decoder and encoder

Model Architecture of GNMT

Model Architecture – Attention Model AttentionFunction is one hidden layer feed forward neural network Similar to Bahdanau et. al., ICLR 2015 - Neural Machine Translation by Jointly Learning to Align and Translate

Model Architecture – Residual Connections simple stacked LSTM layers stacked LSTM layers with residual connections

Model Architecture – Bi-directional Encoder for First Layer Often information is left-to-right in both source and target side. Depending on the language pair the information for a particular output word can be distributed and even be split up in certain regions of the input side. Information required to translate certain words on the output side can appear anywhere on the source side. Used only in the first encoder layer for maximum parallelization.

Model Architecture – Model Parallelism Model and data parallelism to speed up training Data Parallelism (Downpour SGD - Dean et. al. 2012): Training data is divided into n (10) subsets and a copy of the model is run on each of these subsets. Each replica asynchronously updates the parameters using a combination of Adam and SGD. Model Parallelism: Different LSTM layers run on different GPUs. Layers must be Uni-directional.

Segmentation Approaches NMT models often operate with fixed word vocabularies. Translation is fundamentally an open vocabulary problem (names, numbers, dates etc.). Out-of-vocabulary (OOV) words problem Use sub-word units such as chararacters, mixed word/characters or more intelligent sub-words. Copy rare words from source to target since most rare words are names or numbers where the correct translation is just a copy Based on attention model or more complex special purpose pointing network

Segmentation Approaches - Wordpiece Model Wordpiece Model is initially developed to solve a Japanese/Korean segmentation problem for the Google speech recognition system Completely data-driven and guaranteed to generate deterministic segmentation First break words into wordpieces given a trained wordpiece model. Special word boundary symbols are added before training such that the original word sequence can be recovered from the wordpiece sequence without ambiguity. At decoding time, the model first produces a wordpiece sequence, which is then converted into the corresponding word sequence. Shared wordpiece model for both the source language and target language such that the same string in source and target sentence will be segmented in exactly the same way, making it easier for the system to learn to copy these tokens

Segmentation Approaches - Mixed Word/Character Model Fixed-size word vocabulary OOV words are not collapsed into a single UNK symbol as in conventional models, but converted into the sequence of its constituent characters Special prefixes (<B>, <M>, <E>) are added to show the location of the char in the word and to distinguish the char from normal in- vocabulary chars.

Training Criteria Given , ML training aims at maximizing: This objective does not reflect the task reward function as measured by the BLEU score, does not explicitly encourage a ranking among incorrect output sequences. Therefore does not robust to errors made during decoding New objective: Use:

Quantizable Model and Quantized Inference NMT computationally intensive high latency Force and to be within New forward computation: New LSTM with internal gating logic:

Quantizable Model and Quantized Inference Then replace floating point operations with fixed-point integer operations (8 or 16 bit resolution) W is represented with: floating vector and 16-bit integers in range Softmax layer: 8-bit integer matrix

Quantizable Model and Quantized Inference New hardware: Tensor Processing Unit (TPU) Decoding on WMT En FR dataset: Quantized operations are used only for TPU

Decoder Scoring function: and gives pure Beam search is attention probability. lp: length normalization cp: coverage penalty and gives pure Beam search During beam search pruning is used to speed up the search by 30% - 40%. ML without RL refinement ML with RL refinement

Experiments and Results – Dataset and Evaluation Datasets: (36M sentence pair) and (5M sentence pair) for training, newstest2014 for testing, newstest2012 and newstest2013 for development. Also English French, English Spanish and English Chinese from Google-internal datasets. Evaluation: BLEU scores and side-by-side (SxS) evaluations where human translators give scores in range 0 to 6.

Experiments and Results – Training Details 12 replicas run on separate machines, updating shared parameters asynchronously Initialization of the parameters: Uniform between [-0.04, 0.04] Gradient clipping: Norms of gradients are clipped to 5. 6 days training for dataset using 96 NVIDIA K80 GPUs. After convergence im ML swith to RL refinement model, then further optimization using Dropout is used with probabilities 0.2 and 0.3 only in the ML training.

Experiments and Results – Training Details ADAM + Simple SGD learning: 60k steps ADAM (0.0002 learning rate) at the beginnig and then simple SGD (0.5 learning rate) for speed up and better convergence. Learning rate dynamically reduced.

Experiments and Results – Evaluation after ML

Experiments and Results – Model Ensemble and Human Evaluation

Experiments and Results – Results on Production Data

Conclusion On the public WMT’14 translation benchmark GNMT’s translation quality approaches or surpasses all published results. Works also on production scale Key findings: wordpiece modeling effectively handles open vocabularies and the challenge of morphologically rich languages for translation quality and inference speed a combination of model and data parallelism can be used to efficiently train state-of-the-art sequence-to-sequence NMT models in roughly a week model quantization drastically accelerates translation inference, allowing the use of these large models in a deployed production environment many additional details like length-normalization, coverage penalties, and similar are essential to making NMT systems work well on real data