Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi

Slides:

Advertisements

Similar presentations

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Advertisements

DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Addressing the Rare Word Problem in Neural Machine Translation

Haitham Elmarakeby.  Speech recognition

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Attention Model in NLP Jichuan ZENG.

Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi

Neural Machine Translation

Convolutional Sequence to Sequence Learning

SUNY Korea BioData Mining Lab - Journal Review

RNNs: An example applied to the prediction task

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

CSE 190 Neural Networks: The Neural Turing Machine

SeaRNN: training RNNs with global-local losses

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

Recurrent Neural Networks for Natural Language Processing

Adversarial Learning for Neural Dialogue Generation

Mastering the game of Go with deep neural network and tree search

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

AlphaGo with Deep RL Alpha GO.

Intro to NLP and Deep Learning

Joint Training for Pivot-based Neural Machine Translation

Intelligent Information System Lab

Neural networks (3) Regularization Autoencoder

S Digital Communication Systems

Neural Machine Translation By Learning to Jointly Align and Translate

Hybrid computing using a neural network with dynamic external memory

Neural Language Model CS246 Junghoo “John” Cho.

Attention Is All You Need

Deep Learning based Machine Translation

By: Kevin Yu Ph.D. in Computer Engineering

RNNs: Going Beyond the SRN in Language Prediction

Grid Long Short-Term Memory

CSCE Fall 2013 Prof. Jennifer L. Welch.

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Word Embedding Word2Vec.

Neural Networks Geoff Hulten.

Neural Speech Synthesis with Transformer Network

CSCE Fall 2012 Prof. Jennifer L. Welch.

Machine Translation(MT)

Neural networks (3) Regularization Autoencoder

Designing architectures by hand is hard

Word embeddings (continued)

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

RNNs and Sequence to sequence models

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Automatic Handwriting Generation

A unified extension of lstm to deep network

Presented by: Anurag Paul

Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.

Neural Machine Translation using CNN

Question Answering System

Neural Machine Translation

Sequence-to-Sequence Models

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi Hammad Ayyubi CSE 291G We saw NMT with attention - was SOTA at the time. But when working at the scale of Google, new issues crops up. This paper addresses all those. This means two things: No new architecture introduced - no need to scratch our heads to understand. Super cool ideas from behind the scenes at Google.

Contents Challenges for MT at scale Related Work Key Contributions of the paper Main Ideas - details Experiments And Results Strengths of the paper Possible extensions Current state of MT

Challenges for MT at scale Slow training and inference speeds - bane of RNNs. Feed-forward time in RNN scales with length of input sequences. If multiple depths - god help us.

Challenges for MT at scale Inability to address *Rare words Named entities - Barack Obama (English; German), Барак Обама (Russian) Cognates and Loanwords - claustrophobia (English), Klaustrophobie (German) Morphologically complex words - solar system (English), Sonnensystem (Sonne + System) (German) Failure to translate all words in the source sentence - poor coverage. *Given a french sentence which was supposed to say: “The US did not attack the EU! Nothing to fear,” The translated sentence we got: “The US attacked the EU! Fearless.” Given that we are living in Trump era - it’s relevant and reassuring. But lo and behold what did the machine translate it to - that’s just not right! *Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. *Has AI surpassed humans at translation? Skynet Today

Related Work Addressing Rare words: Addressing incomplete coverage: Sébastien, J., Kyunghyun, C., Memisevic, R., and Bengio, Y. On using very large target vocabulary for neural machine translation. Costa-Jussà, M. R., and Fonollosa, J. A. R. Character-based neural machine translation. CoRR abs/1603.00810 (2016). Chung, J., Cho, K., and Bengio, Y. A character-level decoder without explicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147 (2016). Addressing incomplete coverage: Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. Coverage-based neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (2016). Ask shortcomings of each paper for addressing rare words issues. Copy model - problems with transliteration (claustrophobia) Character encoder and decoder - huge model size.

Key Contributions of the paper Uses deeper/bigger network - “We need to go deeper.” Addresses training speed and inference speed using a combination of architectural modifications, usage of TPUs and model quantization. Addresses rare words issue using Word Piece Model (Sub-word units). Addresses source sentence coverage using modified beam search. Refined training strategy based on Reinforcement Learning. Continuing with the theme - We need to go deeper Because you are Google - TPUs

Main Ideas - Details

Encoder RNN here is LSTMs. Only bottom layer is bi-directional. Each layer is placed on separate GPUs (Model Parallelism). Layer (i+1) can start computation before layer i has finished. Can you guess why only bottom layer is bi-directional? Will come to it in the next point. Interesting point - each layer on separate GPUs - helps in parallelism.

Decoder Produces output y_i which then goes through softmax. Output from only the bottom layer is passed to attention module. Can you guess why only output from bottom layer is passed? - To maintain parallelism.

Attention Module AttentionFunction : A feed-forward linear layer with 1024 nodes

Residual Connections No better way to lose attention of the audience than to introduce a bunch of math equations. So will only focus on the important part - skip and added connections. Helps in vanishing and exploding gradients.

Model Input (Addressing Rare words) Wordpiece Model Breaks words into sub-words (wordpieces) using a trained wordpiece model. Word: Jet makers feud over seat width with big orders at stake wordpieces: _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake Adds ‘_’ at the beginning of each word so to recover word sequence from wordpieces. While decoding, model produces a wordpiece sequence from which the corresponding sentence is recovered. We have a learned vocabulary of sub-word inventory, and we break all the words into one of those. Resolves rare word issues.

Wordpiece Model Initialize word unit inventory with all the basic unicode characters along with all ASCII characters. Build language model on the word unit inventory. Generate a new word unit inventory by combining two units from current word inventory. Include the new word unit from all possible ones which increases the language modelling likelihood the most. Continue increasing the word inventory until a pre-specified number of tokens D is reached. Inference Reduce input to sequence of characters. Traverse inverse binary tree to get sub-words. Common Practice Using a number of optimizations - considering “likely” tokens only, parallelizations etc. Same word inventory for both encoder and decoder language. Given a word “Lower”, convert it to sequence of characters - ‘l’, ‘o’, ‘w’, ‘e’, ‘r’. Then traverse the inverse binary tree.

Training Criteria Maximum Likelihood Training Doesn’t reflect the reward function - BLEU score. Doesn’t order sentences according to their BLEU scores - higher BLEU gets higher prob. Thus, isn’t robust to low BLEU score erroneous sentences. Probability of correct sequence at each decoder unit.

RL Based Training Objective denotes per sentence score, calculated over all the output sentences. BLEU score is for corpus text. Thus, use GLEU score - minimum of recall and precision over n-grams of 1,2,3 or 4 grams. Stabilization First train using ML objective and then refine using RL objective.

Model Quantization Model Quantization: Reducing high-precision floating point arithmetic to low-precision integer arithmetic (approximation). For matrix operations etc. Challenge: Amplification of quantization (approximation) error as you go deeper into the network. Solution: Add additional model constraints while training. This ensures the quantization error is small. Clip value of accumulators to small values

Model Quantization Question: If you are clipping values, would training be affected? Answer: No - emperically.

Model Quantization So, we have clipped accumulators during training to enable model quantization. How do we do it during inference? Scale weight matrics to 8-bit integer. Then perform all mutrix multiplication, addition using integer arithmetic.

Model Quantization Result on CPU, GPU and TPU Interesting question: Why GPU takes more time than CPU?

Decoding - addressing coverage Use beam search to find the output sequence Y that maximizes a score function Issues with vanilla beam search: Prefers shorter sentence as probability of sentence keeps reducing on addition of sequences. Doesn’t ensure coverage of source sentence. Solution: Length Normalization Coverage Penalty p(i,j) is the attention probability of jth target word on ith source word. Higher coverage high value of cp.

WMT’14 En -> Fr BLEU scores Decoding - Modified Beam Search WMT’14 En -> Fr BLEU scores Larger values of alpha and beta increase BLEU score by 1.1

Experiments and Results Datasets: WMT En -> Fr Training set consists of 36M English-French sentence pairs. WMT En -> De Training set consists of 5M English-German sentence pairs. Google Production Dataset 2-3 decimal order magnitudes larger than WMT

Experiments and Results Wordpiece model works best. Best score of 38.95 achieved without using any external alignment model as against 45.

Experiments and Results Wordpiece model model gains 2 BLEU points on word model and 4 BLEU points over previous best.

Experiments and Results Evaluation on RL refined model. On En-> De performance drops as there are overlaps between wins for RL and decoder optimizations.

Experiments and Results Evaluation on ensemble model (8). Below table for En->De

Experiments and Results Evaluation with side-by-side human evaluation on 500 samples from newstest2014. Question: Why is the BLEU score high but Side-by-side score low for NMT after RL? Ans: Small datasize (500). Gain of 0.81 is too small for human translators to comprehend. Mismatch between BLEU as a metric to judge human translation quality.

Experiments and Results Evaluation on Google Production Data Using non-RL GNMT 60% avg. improvement.

Strengths of the Paper Show deeper LSTMs with skip connections work better. Better performance of WordPiece model to address the challenge of rare words. RL refined training strategy. Model quantization to improve speed. Modified beam search - length normalization and coverage penalty improves performances.

Discussions/Possible Extensions Show deeper LSTMs work better. Despite the fact that LSTMs’ size scale with size of input, Google can train it fast and iterate experiments using multiple GPUs and TPUs. What about lesser mortals (non-Google, non-FB people) like us? Depth matters - agreed. Can we determine depth dynamically?

Current state of MT- A peek into the future Universal Transformers: Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Łukasz Kaiser Build on the transformer network which simultaneously looks at all sequence and works out self-attention. Try to inculcate inductive bias of RNNs by recurring over depth. Dynamic halting by predicting per-position halting probability at each step (time).

Universal Transformer Synthesis What is MT? What is BLEU? Attention Google NMT Universal Transformer

Thank you! Questions? Tdfafdfahank