Domain Mixing for Chinese-English Neural Machine Translation

Slides:



Advertisements
Similar presentations
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Advertisements

A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
1 Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003.
Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者:郝柏翰 2013/06/04 Thorsten Brants, Ashok.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Addressing the Rare Word Problem in Neural Machine Translation
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock 
Neural Machine Translation
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Japan Science and Technology Agency
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
KantanNeural™ LQR Experiment
A Deep Learning Technical Paper Recommender System
Joint Training for Pivot-based Neural Machine Translation
Natural Language Processing of Knee MRI Reports
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
Neural Machine Translation By Learning to Jointly Align and Translate
Efficient Estimation of Word Representation in Vector Space
General Writing Concerns
Paraphrase Generation Using Deep Learning
Yuri Pettinicchi Jeny Tony Philip
Final Presentation: Neural Network Doc Summarization
Transformer result, convolutional encoder-decoder
Exploring Matching Networks for Cross Language Information Retrieval
Learning to Classify Documents Edwin Zhang Computer Systems Lab
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
On the Impact of Various Types of Noise on Neural Machine Translation
MATERIAL Resources for Cross-Lingual Information Retrieval
Memory-augmented Chinese-Uyghur Neural Machine Translation
Computational Linguistics: New Vistas
Statistical Machine Translation Papers from COLING 2004
Statistical n-gram David ling.
Resource Recommendation for AAN
Machine Translation(MT)
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
Dialogue State Tracking & Dialogue Corpus Survey
Unsupervised Pretraining for Semantic Parsing
Part of Speech Tagging with Neural Architecture Search
Natural Language to SQL(nl2sql)
An Empirical Comparison of Domain Adaptation Methods for
Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.
Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab
Sequential Question Answering Using Neural Coreference Resolution
Corpus Size and the Robustness of Measures of Corpus Distance
Deep Learning for the Soft Cutoff Problem
Stylistic Author Attribution Methods for Identifying Parody Trump Tweets Isaac Pena Department of Computer Science, Yale University, New Haven, CT LILY.
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Neural Machine Translation using CNN
Neural Machine Translation
Report 7 Brandon Silva.
Multilingual Translation – (1)
Neural Machine Translation by Jointly Learning to Align and Translate
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Active AI Projects at WIPO
Presentation transcript:

Domain Mixing for Chinese-English Neural Machine Translation Christopher Leege Department of Computer Science, Yale University LILY Lab Introduction Results While there is a plethora of parallel corpora for languages that are similar, such as Indo-European languages, it is much harder to find corpora for language pairs that are more distance. My goal for this project was to translate Chinese fiction into English. While there are parallel corpora for Chinese-English, the majority focus on domains such as news and legal documents. I was unable to find any parallel corpora that dealt with modern Chinese fiction. As a result, I created my own corpus out of a popular Chinese web novel that has an official English translation. I then trained four neural network translation models and tested how they performed on both data sets. Smaller Casia2015 Fiction 8.79 2.01 1.12 11.79 Mix 9.44 13.67 Mix+Token 9.37 13.01 For the first four models, both the naïve mixture and the mixture with target token mixing performed better than the pure Casia and pure Ficiton datasets. The naïve mixture performed better than the mixture with target tokens on both Casia and Fiction test data. The target token was correctly predicted 2317/2400 times on the Ficiton data set, and 1044/1050 times on the Casia dataset. For the second four models, the pure Fiction dataset performed better than either of the mixed datasets on Fiction. The naïve mixture performed best on the Casia test data, and the pure Casia and the mixture with target tokens performed equally well on Casia test data. Figure 2. BLEU Scores for the four models. Models on the y-axis, test data on the x-axis Conclusion Materials and Methods It is unexpected that the naïve mixing of the Casia and Fiction data sets resulted in better BLEU scores than the mixing with target tokens. Two possible explanations are as follows. One, the Casia data set may be too mixed already. The Casia data set was pulled from various places around the web, and while there are no domain tags, it does seem that the number of domains covered are fairly broad. It is possible that the heterogeneity of the corpus had already led to a degradation of translation quality. If possible, I would redo the experiment with a more homogenous corpus as the base, then try naïve mixing and mixing with target tokens. A second possibility is simply that more data is required. It is possible that the naïve mixing improved scores in both categories because the model was still being refined, and having a larger amount of data was more important than having domain specific data. I scraped the Chinese original and the official English translation of a Chinese web novel and manually aligned forty-five thousand sentences. For my other set of training data, I used the casia2015 corpus, a corpus of sentence pairs collected around the web. I used the nltk tokenizer and jieba, a Chinese tokenizer, to tokenize the Chinese and English corpora. I used OpenNMT to train four neural network translation models on the default encoder-decoder architecture. The first model was trained exclusively on the casia2015 corpus. The second model was trained exclusively on the fiction corpus I compiled. The third model was trained on a mix of the two corpora. The last model was trained on the same mix, but employed target token mixing from the paper Effective Domain Mixing for Neural Machine Translation (Denny Britz, Quoc Le, and Reid Pryzant. 2017). I performed the steps twice, training a total of eight models. For the first four models, I used a smaller portion of the casia2015 corpus. For the second four models, I used the full casia2015 corpus. Larger Casia2015 Fiction 14.38 3.42 0.74 10.60 Mix 14.43 10.50 Mix+Token 10.03 Figure 1. OpenNMT encoder-decoder architecture Figure 3. BLEU Scores for the second four models. Models on the y-axis, test data on the x-axis Acknowledgement Thank you Drago for your guidance this semester. Effective Domain Mixing for Neural Machine Translation (Denny Britz, Quoc Le, and Reid Pryzant. 2017).