Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Slides:

Advertisements

Similar presentations

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Advertisements

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Addressing the Rare Word Problem in Neural Machine Translation

Neural Net Language Models

Haitham Elmarakeby.  Speech recognition

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Attention Model in NLP Jichuan ZENG.

Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi

Neural Machine Translation

TensorFlow– A system for large-scale machine learning

Convolutional Sequence to Sequence Learning

SUNY Korea BioData Mining Lab - Journal Review

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

RNNs: An example applied to the prediction task

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

Joint Training for Pivot-based Neural Machine Translation

Intelligent Information System Lab

Intro to NLP and Deep Learning

Statistical NLP: Lecture 13

Neural networks (3) Regularization Autoencoder

Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.

Neural Machine Translation By Learning to Jointly Align and Translate

Attention Is All You Need

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

Deep Learning based Machine Translation

RNNs: Going Beyond the SRN in Language Prediction

Grid Long Short-Term Memory

Final Presentation: Neural Network Doc Summarization

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Understanding LSTM Networks

Code Completion with Neural Attention and Pointer Networks

Memory-augmented Chinese-Uyghur Neural Machine Translation

Neural Speech Synthesis with Transformer Network

Machine Translation(MT)

Report by: 陆纪圆.

RNNs: Going Beyond the SRN in Language Prediction

Kyoto University Participation to WAT 2016

Neural networks (3) Regularization Autoencoder

Word embeddings (continued)

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

RNNs and Sequence to sequence models

Neural Joint Model for Transition-based Chinese Syntactic Analysis

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Automatic Handwriting Generation

A unified extension of lstm to deep network

Presented by: Anurag Paul

Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Neural Machine Translation using CNN

Learning and Memorization

Neural Machine Translation

Search-Based Approaches to Accelerate Deep Learning

Sequence-to-Sequence Models

Bidirectional LSTM-CRF Models for Sequence Tagging

LHC beam mode classification

Neural Machine Translation by Jointly Learning to Align and Translate

Listen Attend and Spell – a brief introduction

Presentation transcript:

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Outline Introduction and Related works Model Architecture and Implemantation Details Residual Connections and Bi-directional Encoder for First Layer Model Parallelism Segmentation Training Quantizable Model and Quantized Inference Decoder Experiments and Results Conclusion

Introduction and Related works Statistical Machine Translation (SMT) Word-based Machine Translation Phrase-based Machine Translation (PBMT) Neural Machine Translation (NMT)

Model Architecture To achieve good accuracy both encoder and decoder RNNs have to be deep enough to capture subtle irregularities in the source and target languages 8-layer LSTMs are used in both decoder and encoder

Model Architecture of GNMT

Model Architecture – Attention Model AttentionFunction is one hidden layer feed forward neural network Similar to Bahdanau et. al., ICLR 2015 - Neural Machine Translation by Jointly Learning to Align and Translate

Model Architecture – Residual Connections simple stacked LSTM layers stacked LSTM layers with residual connections

Model Architecture – Bi-directional Encoder for First Layer Often information is left-to-right in both source and target side. Depending on the language pair the information for a particular output word can be distributed and even be split up in certain regions of the input side. Information required to translate certain words on the output side can appear anywhere on the source side. Used only in the first encoder layer for maximum parallelization.

Model Architecture – Model Parallelism Model and data parallelism to speed up training Data Parallelism (Downpour SGD - Dean et. al. 2012): Training data is divided into n (10) subsets and a copy of the model is run on each of these subsets. Each replica asynchronously updates the parameters using a combination of Adam and SGD. Model Parallelism: Different LSTM layers run on different GPUs. Layers must be Uni-directional.

Segmentation Approaches NMT models often operate with fixed word vocabularies. Translation is fundamentally an open vocabulary problem (names, numbers, dates etc.). Out-of-vocabulary (OOV) words problem Use sub-word units such as chararacters, mixed word/characters or more intelligent sub-words. Copy rare words from source to target since most rare words are names or numbers where the correct translation is just a copy Based on attention model or more complex special purpose pointing network

Segmentation Approaches - Wordpiece Model Wordpiece Model is initially developed to solve a Japanese/Korean segmentation problem for the Google speech recognition system Completely data-driven and guaranteed to generate deterministic segmentation First break words into wordpieces given a trained wordpiece model. Special word boundary symbols are added before training such that the original word sequence can be recovered from the wordpiece sequence without ambiguity. At decoding time, the model first produces a wordpiece sequence, which is then converted into the corresponding word sequence. Shared wordpiece model for both the source language and target language such that the same string in source and target sentence will be segmented in exactly the same way, making it easier for the system to learn to copy these tokens

Segmentation Approaches - Mixed Word/Character Model Fixed-size word vocabulary OOV words are not collapsed into a single UNK symbol as in conventional models, but converted into the sequence of its constituent characters Special prefixes (<B>, <M>, <E>) are added to show the location of the char in the word and to distinguish the char from normal in- vocabulary chars.

Training Criteria Given , ML training aims at maximizing: This objective does not reflect the task reward function as measured by the BLEU score, does not explicitly encourage a ranking among incorrect output sequences. Therefore does not robust to errors made during decoding New objective: Use:

Quantizable Model and Quantized Inference NMT computationally intensive high latency Force and to be within New forward computation: New LSTM with internal gating logic:

Quantizable Model and Quantized Inference Then replace floating point operations with fixed-point integer operations (8 or 16 bit resolution) W is represented with: floating vector and 16-bit integers in range Softmax layer: 8-bit integer matrix

Quantizable Model and Quantized Inference New hardware: Tensor Processing Unit (TPU) Decoding on WMT En FR dataset: Quantized operations are used only for TPU

Decoder Scoring function: and gives pure Beam search is attention probability. lp: length normalization cp: coverage penalty and gives pure Beam search During beam search pruning is used to speed up the search by 30% - 40%. ML without RL refinement ML with RL refinement

Experiments and Results – Dataset and Evaluation Datasets: (36M sentence pair) and (5M sentence pair) for training, newstest2014 for testing, newstest2012 and newstest2013 for development. Also English French, English Spanish and English Chinese from Google-internal datasets. Evaluation: BLEU scores and side-by-side (SxS) evaluations where human translators give scores in range 0 to 6.

Experiments and Results – Training Details 12 replicas run on separate machines, updating shared parameters asynchronously Initialization of the parameters: Uniform between [-0.04, 0.04] Gradient clipping: Norms of gradients are clipped to 5. 6 days training for dataset using 96 NVIDIA K80 GPUs. After convergence im ML swith to RL refinement model, then further optimization using Dropout is used with probabilities 0.2 and 0.3 only in the ML training.

Experiments and Results – Training Details ADAM + Simple SGD learning: 60k steps ADAM (0.0002 learning rate) at the beginnig and then simple SGD (0.5 learning rate) for speed up and better convergence. Learning rate dynamically reduced.

Experiments and Results – Evaluation after ML

Experiments and Results – Model Ensemble and Human Evaluation

Experiments and Results – Results on Production Data

Conclusion On the public WMT’14 translation benchmark GNMT’s translation quality approaches or surpasses all published results. Works also on production scale Key findings: wordpiece modeling effectively handles open vocabularies and the challenge of morphologically rich languages for translation quality and inference speed a combination of model and data parallelism can be used to efficiently train state-of-the-art sequence-to-sequence NMT models in roughly a week model quantization drastically accelerates translation inference, allowing the use of these large models in a deployed production environment many additional details like length-normalization, coverage penalties, and similar are essential to making NMT systems work well on real data