Domain Mixing for Chinese-English Neural Machine Translation

Slides:

Advertisements

Similar presentations

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.

Advertisements

A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.

1 Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003.

Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者：郝柏翰 2013/06/04 Thorsten Brants, Ashok.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.

Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.

Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/

An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Addressing the Rare Word Problem in Neural Machine Translation

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.

A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock

Neural Machine Translation

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Japan Science and Technology Agency

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

KantanNeural™ LQR Experiment

A Deep Learning Technical Paper Recommender System

Joint Training for Pivot-based Neural Machine Translation

Natural Language Processing of Knee MRI Reports

Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.

Neural Machine Translation By Learning to Jointly Align and Translate

Efficient Estimation of Word Representation in Vector Space

General Writing Concerns

Paraphrase Generation Using Deep Learning

Yuri Pettinicchi Jeny Tony Philip

Final Presentation: Neural Network Doc Summarization

Transformer result, convolutional encoder-decoder

Exploring Matching Networks for Cross Language Information Retrieval

Learning to Classify Documents Edwin Zhang Computer Systems Lab

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

On the Impact of Various Types of Noise on Neural Machine Translation

MATERIAL Resources for Cross-Lingual Information Retrieval

Memory-augmented Chinese-Uyghur Neural Machine Translation

Computational Linguistics: New Vistas

Statistical Machine Translation Papers from COLING 2004

Statistical n-gram David ling.

Resource Recommendation for AAN

Machine Translation(MT)

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Dialogue State Tracking & Dialogue Corpus Survey

Unsupervised Pretraining for Semantic Parsing

Part of Speech Tagging with Neural Architecture Search

Natural Language to SQL(nl2sql)

An Empirical Comparison of Domain Adaptation Methods for

Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.

Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab

Sequential Question Answering Using Neural Coreference Resolution

Corpus Size and the Robustness of Measures of Corpus Distance

Deep Learning for the Soft Cutoff Problem

Stylistic Author Attribution Methods for Identifying Parody Trump Tweets Isaac Pena Department of Computer Science, Yale University, New Haven, CT LILY.

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Neural Machine Translation using CNN

Neural Machine Translation

Report 7 Brandon Silva.

Multilingual Translation – (1)

Neural Machine Translation by Jointly Learning to Align and Translate

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Active AI Projects at WIPO

Presentation transcript:

Domain Mixing for Chinese-English Neural Machine Translation Christopher Leege Department of Computer Science, Yale University LILY Lab Introduction Results While there is a plethora of parallel corpora for languages that are similar, such as Indo-European languages, it is much harder to find corpora for language pairs that are more distance. My goal for this project was to translate Chinese fiction into English. While there are parallel corpora for Chinese-English, the majority focus on domains such as news and legal documents. I was unable to find any parallel corpora that dealt with modern Chinese fiction. As a result, I created my own corpus out of a popular Chinese web novel that has an official English translation. I then trained four neural network translation models and tested how they performed on both data sets. Smaller Casia2015 Fiction 8.79 2.01 1.12 11.79 Mix 9.44 13.67 Mix+Token 9.37 13.01 For the first four models, both the naïve mixture and the mixture with target token mixing performed better than the pure Casia and pure Ficiton datasets. The naïve mixture performed better than the mixture with target tokens on both Casia and Fiction test data. The target token was correctly predicted 2317/2400 times on the Ficiton data set, and 1044/1050 times on the Casia dataset. For the second four models, the pure Fiction dataset performed better than either of the mixed datasets on Fiction. The naïve mixture performed best on the Casia test data, and the pure Casia and the mixture with target tokens performed equally well on Casia test data. Figure 2. BLEU Scores for the four models. Models on the y-axis, test data on the x-axis Conclusion Materials and Methods It is unexpected that the naïve mixing of the Casia and Fiction data sets resulted in better BLEU scores than the mixing with target tokens. Two possible explanations are as follows. One, the Casia data set may be too mixed already. The Casia data set was pulled from various places around the web, and while there are no domain tags, it does seem that the number of domains covered are fairly broad. It is possible that the heterogeneity of the corpus had already led to a degradation of translation quality. If possible, I would redo the experiment with a more homogenous corpus as the base, then try naïve mixing and mixing with target tokens. A second possibility is simply that more data is required. It is possible that the naïve mixing improved scores in both categories because the model was still being refined, and having a larger amount of data was more important than having domain specific data. I scraped the Chinese original and the official English translation of a Chinese web novel and manually aligned forty-five thousand sentences. For my other set of training data, I used the casia2015 corpus, a corpus of sentence pairs collected around the web. I used the nltk tokenizer and jieba, a Chinese tokenizer, to tokenize the Chinese and English corpora. I used OpenNMT to train four neural network translation models on the default encoder-decoder architecture. The first model was trained exclusively on the casia2015 corpus. The second model was trained exclusively on the fiction corpus I compiled. The third model was trained on a mix of the two corpora. The last model was trained on the same mix, but employed target token mixing from the paper Effective Domain Mixing for Neural Machine Translation (Denny Britz, Quoc Le, and Reid Pryzant. 2017). I performed the steps twice, training a total of eight models. For the first four models, I used a smaller portion of the casia2015 corpus. For the second four models, I used the full casia2015 corpus. Larger Casia2015 Fiction 14.38 3.42 0.74 10.60 Mix 14.43 10.50 Mix+Token 10.03 Figure 1. OpenNMT encoder-decoder architecture Figure 3. BLEU Scores for the second four models. Models on the y-axis, test data on the x-axis Acknowledgement Thank you Drago for your guidance this semester. Effective Domain Mixing for Neural Machine Translation (Denny Britz, Quoc Le, and Reid Pryzant. 2017).