Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Word Sense Disambiguation for Machine Translation Han-Bin Chen
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Machine translation Context-based approach Lucia Otoyo.
The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Statistical Machine Translation Part III – Phrase-based SMT Alexander Fraser CIS, LMU München WSD and MT.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
1 Gholamreza Haffari Simon Fraser University PhD Seminar, August 2009 Machine Learning approaches for dealing with Limited Bilingual Data in SMT.
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Bayesian Subtree Alignment Model based on Dependency Trees Toshiaki Nakazawa Sadao Kurohashi Kyoto University 1 IJCNLP2011.
Compact WFSA based Language Model and Its Application in Statistical Machine Translation Xiaoyin Fu, Wei Wei, Shixiang Lu, Dengfeng Ke, Bo Xu Interactive.
Korea Maritime and Ocean University NLP Jung Tae LEE
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪
Yajuan Lü, Jin Huang and Qun Liu EMNLP, 2007 Presented by Mei Yang, May 12nd, 2008 Improving SMT Preformance by Training Data Selection and Optimization.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Imposing Constraints from the Source Tree on ITG Constraints for SMT Hirofumi Yamamoto, Hideo Okuma, Eiichiro Sumita National Institute of Information.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
English-Hindi Neural machine translation and parallel corpus generation EKANSH GUPTA ROHIT GUPTA.
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Statistical Machine Translation Part II: Word Alignments and EM
Xiaolin Wang Andrew Finch Masao Utiyama Eiichiro Sumita
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Joint Training for Pivot-based Neural Machine Translation
Statistical Machine Translation Papers from COLING 2004
Presentation transcript:

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR

Introduction Chinese word segmentation (CWS) is a necessary step in Chinese-English statistical machine translation (SMT) Performance of CWS has an impact on the results of SMT. The common solution in Chinese-to-English translation has been to segment the Chinese text using an off-the-shelf CWS tool which is trained on manually segmented corpus.

Main problems of using an off-the-shelf CWS tool Word granularity in the existing corpus is not necessarily suitable for SMT. –none of the existing corpora is specially developed for SMT When the CWS tool is used in a special domain which is different from its training corpus, the disambiguation ability of the CWS tool will drop and the performance of the SMT system will be influenced.

Clues to solve the problems When we use a CWS tool to segment the Chinese side of Chinese-English parallel corpus which is used to train the SMT model, the English side is often be neglected. Actually, there are many clues in the English side which can be used to determine an appropriate word granularity and resolve CWS ambiguity.

Our solution: Adaptation We use two state-of-the-art CWS tools to preprocess the Chinese texts for our SMT system. –ICTCLAS (ICT, China) –hybrid model (NICT,ATR) Resolve CWS ambiguity by the information acquired by performing Chinese character to English word alignment by GIZA++ toolkits

Adaptation algorithm for CWS For each sentence C in the Chinese side in the parallel corpus { W1 = segment C with ICTCLAS; W2 = segment C with the hybrid model; If ( W1 = = W2 ) add W1 into the training set of the hybrid model; Else { A = character to word alignment of C and its English translation; W3= resolve (W1,W2,A); add W3 into the training set of the hybrid model; } Retrain the hybrid model with the augmented data;

An example There are two possible ways to segment the character sequence “ 马上来 ” in the Chinese sentence: “ 有人受伤了请马上来 ” (there has been a injury please come right away): –“ 马上 (right away) + 来 (come)” –“ 马 (horse) + 上来 (come up)”. All these four words are frequently used in Chinese text and it is very difficult for any CWS tool to make right decision without enough training data.

An example (Cont.) The alignment result of the above sentence pair: – 有 -1 人 -2 受 -3 伤 -4 了 -5 请 -6 马 -7 上 -8 来 -9 –NULL ({ }) there ({ 1 }) has ({ }) been ({ }) a ({ }) injury ({ }) please ({ 6 }) come ({ 9 }) right ({ }) away ({ 7 8 }) It is clear that “ 马 -7 ” and “ 上 -8 ” are aligned to the same English word “away”, while “ 来 ” is aligned to the word “come”. So we choose “ 马上 (right away) 来 (come)” as the right segmentation result.

Experiments To evaluate the effect of our CWS adaptation algorithm, we apply it to the Chinese to English translation task of the IWSLT For comparison, we use three CWS tools. –ICTCLAS –hybrid model –Re-trained hybrid model

Experimental setting Our SMT system is based on a fairly typical phrase-based model (Finch and Sumita, 2008). We use a 5-gram language model trained with modified Knesser-Ney smoothing. Minimum error rate training (MERT) with respect to BLEU score is used to tune the decoder’s parameters

BLEUNISTWERMETEOR(BLEU+METEOR)/2 Devset Devset Devset BLEUNISTWERMETEOR(BLEU+METEOR)/2 Devset * * Devset Devset BLEUNISTWERMETEOR(BLEU+METEOR)/2 Devset * * * Devset * * * * * Devset * * * * * hybrid model Re-trained hybrid model ICTCLAS

Related work Xu et al. (2005) integrate the segmentation process with the search for the best translation. Xu et al. (2006) propose an integration algorithm of English-Chinese word segmentation and alignment. Ma et al. (2007) introduce a method to pack words for word alignment. Chang et al. (2008) propose an algorithm to directly optimize segmentation granularity for translation quality.

Conclusion A very simple and effective adaptation algorithm is proposed. Experimental results show that the our method can lead to better performance than two state-of-the-art CWS tools.

Future work Now only two segmentation candidates are considered for each sentence In the future, we should extend our method to deal with n-best segmentation to get larger room for improvements. Now we simply combined all the Sighan corpora which adopt various specifications. So there should be inconsistent word granularity. We plan to acquire a uniform specification by making use of alignment information.

Any comment is welcome! 谢谢!