Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Slides:



Advertisements
Similar presentations
An Interactive-Voting Based Map Matching Algorithm
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Zhenghua Li, Jiayuan Chao, Min Zhang, Wenliang Chen {zhli13, minzhang, Soochow University, China Coupled Sequence.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
INSTITUTE OF COMPUTING TECHNOLOGY Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
1 Gholamreza Haffari Simon Fraser University PhD Seminar, August 2009 Machine Learning approaches for dealing with Limited Bilingual Data in SMT.
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Korea Maritime and Ocean University NLP Jung Tae LEE
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
NTU & MSRA Ming-Feng Tsai
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Language Identification and Part-of-Speech Tagging
Statistical Machine Translation Part II: Word Alignments and EM
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Memory-augmented Chinese-Uyghur Neural Machine Translation
Statistical Machine Translation Papers from COLING 2004
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University

Outline Motivation Method of combining multiple segmentation results Experiment & Evaluation Conclusion

Motivation 1/2 Training data Test dataF-measure News domain 97.62% Science83.89% ● CTB test data ● OOV : 3.47% ● Science annotated data ● OOV : 22.4% CTB training data

Motivation 2/2 Background: Development of a domain-specific Chinese-English machine translation system, Problem: Accuracy of Chinese Word Segmentation (CWS) on large amounts of training text often decreases. Many errors in translation knowledge extraction Therefore seriously affects translation quality

Our resolution Related work Domain-Adapted Chinese Word Segmentation Based on statistical Features In previous work, only 1-best result is adopted generally, and ignored the lower ranking result. Bilingually motivated domain-adapted word segmentation Many characters are aligned to NULL which decrease accuracy of Chinese segmentation. Our goal : Extend these method to augment domain adaptation of CWS

Our approach We propose a linear model to combine multiple Chinese word segmentation results of the two segmenters to augment domain adaptation. Segmenter based on n-gram features of Chinese raw corpus. Segmenter based on bilingually motivated features.

Chinese raw corpus Annotated corpus Chinese sentences Chinese sentences English Sentences English Sentences Training CRF model CRF segmenter Word alignment result Bilingual segmenter Results Result Linear- model for combining multiple results Linear- model for combining multiple results Segmentation result Framework

Annotated corpus Extracting statistical features Extracting statistical features Training CRF model Training CRF model N-gram statistical features N-gram statistical features Raw corpus Test data Extracting statistical features Extracting statistical features CRF Decoding CRF Decoding CRF segmenter Segmentation result

CRF segmenter Exploring statistical features of large-scale domain- specific Chinese raw corpus N-gram frequency feature N-gram AV (Accessor Variety) feature Output of CRF models N-best list of segmentation results Corresponding probability scores

Observation Some erroneous segmentations in 1-best result are segmented correctly in the low-ranking results. We intend to utilize correct parts within the 10-best results and the corresponding probability scores.

Bilingual segmenter The boundaries of Chinese word are inferable on parallel corpus. Marked word boundaries in English sentences. Alignment from English word to Chinese word.

Inference step 1. Conduct word alignments using GIZA++, regarding each character of Chinese sentence as one word. 2. For each alignment a i =, if the characters in C are consecutive in the sentence. Take C as a word Calculate its confidence score (refer to paper)

Linear model Calculate score of C i j being a word by combine multiple segmentation results F(i, j) denotes the score of characters from i to j being a word. λ (1≤k≤K) are weights of K segmentation results. Conf k (i,j) (1≤k≤K) is the confidence score of the kth segmentation result. seg k (i, j) (1≤k≤K) is a two-valued function.

Decoding C i j and F(i, j) being represented in a lattice The best sequence is found by dynamic programming algorithm. Search a sequence of words with a maximum product of their scores.

Training parameter λ Initial point λ l (1≤l≤K): A point in K-dimensional parameter space is randomly selected. The parameters λ l are optimized through iterative process. In each step, only one parameter is optimized, while keeping all other parameters fixed.

Experiment setting Experimental data: NTCIR-10 Chinese-English parallel patent description sentences Annotation set: randomly selected 300 sentence pairs. 150 sentences used for training the lattice parameters. 150 sentences used for evaluation.

Evaluation We conduct evaluations from two aspects: Evaluation (1): accuracy of Chinese word segmentation (F-measure) Evaluation (2): translation quality of MT system (BLEU)

Evaluation(1) Method Precision [%] Recall [%] F-measure [%] Bilingually motivated segmenter best of CRF segmenter(baseline) Linear-model (our approach) Accuracy of Chinese word segmentation

Evaluation(2) We develop a phrase-based SMT with Moses, using different Chinese segmenters 1-best of CRF segmenter (baseline) Linear model (our approach) Stanford Chinese segmenter NLPIR Chinese segmenter

Evaluation (2): result SMT using different Chinese segmenterBLEU[%] 1-best of CRF segmenter (baseline)30.53 Linear model (our approach)31.15 Stanford Chinese segmenter30.98 NLPIR Chinese segmenter30.56  Our approach increased by 0.62% compared to baseline.  Performance of our approach is better than the two popular segmenters.

Result Analysis CRF 1-best result Corresponding English word Linear-model result 甘 || 氨酸 Glycine 甘氨酸 聚合 || 物 Polymer 聚合物 碳原子 Carbon || atoms 碳 || 原子 碘复合物 Iodine || complex 碘 || 复合物 抗 || 微生物 Antimicrobial 抗微生物

Conclusion We propose a linear model to combine multiple segmentation results from two segmenters to augment domain-adaptation. one based on n-gram statistical feature of large Chinese raw corpus. the other one based on bilingually motivated features of parallel corpus. The experimental results show that both F-measure of CWS result and the BLEU score of SMT are improved.

Thanks! Q&A