Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Similar presentations


Presentation on theme: "Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel"— Presentation transcript:

1 Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

2 Outline Introduction and Related works
Model Architecture and Implemantation Details Residual Connections and Bi-directional Encoder for First Layer Model Parallelism Segmentation Training Quantizable Model and Quantized Inference Decoder Experiments and Results Conclusion

3 Introduction and Related works
Statistical Machine Translation (SMT) Word-based Machine Translation Phrase-based Machine Translation (PBMT) Neural Machine Translation (NMT)

4 Model Architecture To achieve good accuracy both encoder and decoder RNNs have to be deep enough to capture subtle irregularities in the source and target languages 8-layer LSTMs are used in both decoder and encoder

5 Model Architecture of GNMT

6 Model Architecture – Attention Model
AttentionFunction is one hidden layer feed forward neural network Similar to Bahdanau et. al., ICLR Neural Machine Translation by Jointly Learning to Align and Translate

7 Model Architecture – Residual Connections
simple stacked LSTM layers stacked LSTM layers with residual connections

8 Model Architecture – Bi-directional Encoder for First Layer
Often information is left-to-right in both source and target side. Depending on the language pair the information for a particular output word can be distributed and even be split up in certain regions of the input side. Information required to translate certain words on the output side can appear anywhere on the source side. Used only in the first encoder layer for maximum parallelization.

9 Model Architecture – Model Parallelism
Model and data parallelism to speed up training Data Parallelism (Downpour SGD - Dean et. al. 2012): Training data is divided into n (10) subsets and a copy of the model is run on each of these subsets. Each replica asynchronously updates the parameters using a combination of Adam and SGD. Model Parallelism: Different LSTM layers run on different GPUs. Layers must be Uni-directional.

10 Segmentation Approaches
NMT models often operate with fixed word vocabularies. Translation is fundamentally an open vocabulary problem (names, numbers, dates etc.). Out-of-vocabulary (OOV) words problem Use sub-word units such as chararacters, mixed word/characters or more intelligent sub-words. Copy rare words from source to target since most rare words are names or numbers where the correct translation is just a copy Based on attention model or more complex special purpose pointing network

11 Segmentation Approaches - Wordpiece Model
Wordpiece Model is initially developed to solve a Japanese/Korean segmentation problem for the Google speech recognition system Completely data-driven and guaranteed to generate deterministic segmentation First break words into wordpieces given a trained wordpiece model. Special word boundary symbols are added before training such that the original word sequence can be recovered from the wordpiece sequence without ambiguity. At decoding time, the model first produces a wordpiece sequence, which is then converted into the corresponding word sequence. Shared wordpiece model for both the source language and target language such that the same string in source and target sentence will be segmented in exactly the same way, making it easier for the system to learn to copy these tokens

12 Segmentation Approaches - Mixed Word/Character Model
Fixed-size word vocabulary OOV words are not collapsed into a single UNK symbol as in conventional models, but converted into the sequence of its constituent characters Special prefixes (<B>, <M>, <E>) are added to show the location of the char in the word and to distinguish the char from normal in- vocabulary chars.

13 Training Criteria Given , ML training aims at maximizing:
This objective does not reflect the task reward function as measured by the BLEU score, does not explicitly encourage a ranking among incorrect output sequences. Therefore does not robust to errors made during decoding New objective: Use:

14 Quantizable Model and Quantized Inference
NMT computationally intensive high latency Force and to be within New forward computation: New LSTM with internal gating logic:

15 Quantizable Model and Quantized Inference
Then replace floating point operations with fixed-point integer operations (8 or 16 bit resolution) W is represented with: floating vector and bit integers in range Softmax layer: 8-bit integer matrix

16 Quantizable Model and Quantized Inference
New hardware: Tensor Processing Unit (TPU) Decoding on WMT En FR dataset: Quantized operations are used only for TPU

17 Decoder Scoring function: and gives pure Beam search
is attention probability. lp: length normalization cp: coverage penalty and gives pure Beam search During beam search pruning is used to speed up the search by 30% - 40%. ML without RL refinement ML with RL refinement

18 Experiments and Results – Dataset and Evaluation
Datasets: (36M sentence pair) and (5M sentence pair) for training, newstest for testing, newstest2012 and newstest for development. Also English French, English Spanish and English Chinese from Google-internal datasets. Evaluation: BLEU scores and side-by-side (SxS) evaluations where human translators give scores in range 0 to 6.

19 Experiments and Results – Training Details
12 replicas run on separate machines, updating shared parameters asynchronously Initialization of the parameters: Uniform between [-0.04, 0.04] Gradient clipping: Norms of gradients are clipped to 5. 6 days training for dataset using 96 NVIDIA K80 GPUs. After convergence im ML swith to RL refinement model, then further optimization using Dropout is used with probabilities 0.2 and 0.3 only in the ML training.

20 Experiments and Results – Training Details
ADAM + Simple SGD learning: 60k steps ADAM ( learning rate) at the beginnig and then simple SGD (0.5 learning rate) for speed up and better convergence. Learning rate dynamically reduced.

21 Experiments and Results – Evaluation after ML

22 Experiments and Results – Model Ensemble and Human Evaluation

23 Experiments and Results – Results on Production Data

24 Conclusion On the public WMT’14 translation benchmark GNMT’s translation quality approaches or surpasses all published results. Works also on production scale Key findings: wordpiece modeling effectively handles open vocabularies and the challenge of morphologically rich languages for translation quality and inference speed a combination of model and data parallelism can be used to efficiently train state-of-the-art sequence-to-sequence NMT models in roughly a week model quantization drastically accelerates translation inference, allowing the use of these large models in a deployed production environment many additional details like length-normalization, coverage penalties, and similar are essential to making NMT systems work well on real data


Download ppt "Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel"

Similar presentations


Ads by Google