Deep Learning based Machine Translation Zhiwei Yu
Content Introduction Basic Structure Current Problem Advancing NMT Encoder-Decoder Attention Current Problem Advancing NMT Translation Quality Range of Application Efficiency Future Work
Introduction Commercial Use Google translates over 100 billion words a day Facebook has just rolled out new homegrown MT eBay uses MT to enable cross-border trade Academic Influence ACL17 8.6% –long 14/195; short 12/107 EMNLP17 8.0% – 26/322 NAACL18 6.3% – long 13/207; short 8/125 ACL18 5.7% – long 12/258; short 10/126 Other papers appear in IJCAI, AAAI, NIPS, ICLR or TACL,TASLP etc.
Introduction (Junczys-Dowmunt et al,2016)
Content Introduction Basic Structure Current Problem Advancing NMT Encoder-Decoder Attention Current Problem Advancing NMT Translation Quality Range of Application Efficiency Future Work
Encoder-Decoder Model
Attention Mechanism (Bahdanau et al., 2015) (Luong et al., 2015)
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Attention Mechanism
Content Introduction Basic Structure Current Problem Advancing NMT Encoder-Decoder Attention Current Problem Advancing NMT Translation Quality Range of Application Efficiency Future Work
Current Problem Limited Vocabulary e.g.: UNK Source language translation coverage issues e.g.: over-translation under-translation Translation is not faithful e.g.: low-frequency word
Content Introduction Basic Structure Current Problem Advancing NMT Encoder-Decoder Attention Current Problem Advancing NMT Translation Quality Range of Application Efficiency Future Work
Advancing NMT Efficiency Translation Quality Range of Application Add Linguistic Knowledge SMT+NMT Model Structure ... Translation Quality Unsupervised Learning Multilingual Translation Multimodal Translation .... Range of Application Parallel processing Decoding Efficiency ... Efficiency Efficiency
Content Introduction Basic Structure Current Problem Advancing NMT Encoder-Decoder Attention Current Problem Advancing NMT Translation Quality Range of Application Efficiency Future Work
Add Linguistic Knowledge Incorporate syntax information into the encoder or decoder to enhance the structure knowledge syntactic trees can be used to model the grammatical validity of translation partial syntactic structures can be used as additional context to facilitate future target word prediction (Wu et al., 2017)
Add Linguistic Knowledge • ACL17 – Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder – Modeling Source Syntax for Neural Machine Translation – Sequence-to-Dependency Neural Machine Translation – Chunk-based Decoder for Neural Machine Translation – Chunk-Based Bi-Scale Decoder for Neural Machine Translation(short paper) – Learning to Parse and Translate Improves Neural Machine ranslation(short paper) – Towards String-To-Tree Neural Machine Translation(short paper) • EMNLP17 – Graph Convolutional Encoders for Syntax-aware Neural Machine Translation – Neural Machine Translation with Source-Side Latent Graph Parsing – Neural Machine Translation with Source Dependency Representation (short paper) • ACL18 – Forest-Based Neural Machine Translation – Practical Target Syntax for Neural Machine Translation (short paper)
SMT+NMT Including attempts on adding SMT methods, model and results e.g.:Structural Bias,Position Bias,Fertility,Markov Condition,Bilingual Symmetry BLEU:2.1 ↑ (Zhang et al., 2017)
SMT+NMT • ACL17 – Incorporating Word Reordering Knowledge into Attention-based Neural Machine Translation – Prior Knowledge Integration for Neural Machine Translation using Posterior Regularization – Neural system combination for machine translation • EMNLP17 – Neural Machine Translation Leveraging Phrasebased Models in a Hybrid Search – Translating Phrases in Neural Machine Translation • ICLR18 – Towards Neural Phrase-based Machine Translation
Model Structure Coverage over-translation, under-translation Context Gate Faithful translation External Memory Neural Turing Machine Memory Network Long-dependency & Memory Space Character/Subword Level NMT OOV
Model Structure • ACL16 – Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. –Pointing the Unknown Words –Modeling Coverage for Neural Machine Translation. • EMNLP16 –Sequence-level knowledge distillation •ACL18 – Attention Focusing for Neural Machine Translation by Bridging Source and Target Embeddings – Sparse and Constrained Attention for Neural Machine Translation
Content Introduction Basic Structure Current Problem Advancing NMT Encoder-Decoder Attention Current Problem Advancing NMT Translation Quality Range of Application Efficiency Future Work
Unsupervised Learning Machine Translation without enough parallel corpus A good initial model+Denoising Autoencoder+Back-translation &Iteration (Yang et al., 2018)
Unsupervised Learning • ACL17 – Data Augmentation for Low-Resource Neural Machine Translation (short paper) • ICLR18 – Word Translation Without Parallel Data – Unsupervised Machine Translation Using Monolingual Corpora Only – Unsupervised Neural Machine Translation • ACL18 – Unsupervised Neural Machine Translation with Weight Sharing – Adaptive Knowledge Sharing in Multi-Task Learning: Improving Low-Resource Neural Machine Translation (short paper)
Multilingual Translation ACL15 Multi-task learning for multiple language translation (en-fr/du/sp) NAACL16 Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism Multi-source neural translation
Multimodal Translation Use the information contained in the image to improve the translation quality
Multimodal Translation • ACL17 – Doubly-Attentive Decoder for Multi-modal Neural Machine Translation • EMNLP17 – Incorporating Global Visual Features into Attentionbased Neural Machine Translation – An empirical study of the effectiveness of images on Multi-modal Neural Machine Translation • ACL18 – Learning Translations via Images: A Large Multilingual Dataset and Comprehensive Study
Content Introduction Basic Structure Current Problem Advancing NMT Encoder-Decoder Attention Current Problem Advancing NMT Translation Quality Range of Application Efficiency Future Work
Parallel processing Speed up the training process (Gehring et al., 2017)
Parallel processing • ACL17 – A Convolutional Encoder Model for Neural Machine Translation (Convolutional Sequence to Sequence Learning) • NIPS17 – Attention Is All You Need. • ICLR2018 – Non-Autoregressive Neural Machine Translation
Decoding Efficiency Improve the decoding efficiency by reducing the vocab, adopting distillation model and training the decoder independently • ACL17 – Neural Machine Translation via Binary Code Prediction – Speeding up Neural Machine Translation Decoding by Shrinking Run-time Vocabulary (short paper) • EMNLP17 – Trainable Greedy Decoding for Neural Machine Translation – Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU(short)
Content Introduction Basic Structure Current Problem Advancing NMT Encoder-Decoder Attention Current Problem Advancing NMT Translation Quality Range of Application Efficiency Future Work
Future Work • Interpretability Use linguistic knowledge to explain the performance of the models. • External Knowledge Pun,Metaphor,Metonymy,Allegory,Paradox etc. e.g.:Make the hay while the sun shines. • Larger-context NMT Paragraphs, articles, books, etc. need: Effective attention mechanism for long sequences Tracking states over many sentences (Dialogue systems) •Unsupervised Learning
Future Work •Maximum Likelihood Estimation+ Disadvantages Exposure bias Weak correlation with a true reward Potential Solution Maximize the sequence-wise global loss Incorporate inference into training Stochastic inference Policy gradient (Ranzato et al., ICLR2016; Bahdanau et al., arXiv2016) Minimum risk training (Shen et al., ACL2016) Deterministic inference Learning to search (Wiseman & Rush, arXiv2016)
Thank you