Download presentation
Presentation is loading. Please wait.
Published byCharles Baldwin Modified over 9 years ago
1
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University
2
Outline Motivation Method of combining multiple segmentation results Experiment & Evaluation Conclusion
3
Motivation 1/2 Training data Test dataF-measure News domain 97.62% Science83.89% ● CTB test data ● OOV : 3.47% ● Science annotated data ● OOV : 22.4% CTB training data
4
Motivation 2/2 Background: Development of a domain-specific Chinese-English machine translation system, Problem: Accuracy of Chinese Word Segmentation (CWS) on large amounts of training text often decreases. Many errors in translation knowledge extraction Therefore seriously affects translation quality
5
Our resolution Related work Domain-Adapted Chinese Word Segmentation Based on statistical Features In previous work, only 1-best result is adopted generally, and ignored the lower ranking result. Bilingually motivated domain-adapted word segmentation Many characters are aligned to NULL which decrease accuracy of Chinese segmentation. Our goal : Extend these method to augment domain adaptation of CWS
6
Our approach We propose a linear model to combine multiple Chinese word segmentation results of the two segmenters to augment domain adaptation. Segmenter based on n-gram features of Chinese raw corpus. Segmenter based on bilingually motivated features.
7
Chinese raw corpus Annotated corpus Chinese sentences Chinese sentences English Sentences English Sentences Training CRF model CRF segmenter Word alignment result Bilingual segmenter Results Result Linear- model for combining multiple results Linear- model for combining multiple results Segmentation result Framework
8
Annotated corpus Extracting statistical features Extracting statistical features Training CRF model Training CRF model N-gram statistical features N-gram statistical features Raw corpus Test data Extracting statistical features Extracting statistical features CRF Decoding CRF Decoding CRF segmenter Segmentation result
9
CRF segmenter Exploring statistical features of large-scale domain- specific Chinese raw corpus N-gram frequency feature N-gram AV (Accessor Variety) feature Output of CRF models N-best list of segmentation results Corresponding probability scores
10
Observation Some erroneous segmentations in 1-best result are segmented correctly in the low-ranking results. We intend to utilize correct parts within the 10-best results and the corresponding probability scores.
11
Bilingual segmenter The boundaries of Chinese word are inferable on parallel corpus. Marked word boundaries in English sentences. Alignment from English word to Chinese word.
12
Inference step 1. Conduct word alignments using GIZA++, regarding each character of Chinese sentence as one word. 2. For each alignment a i =, if the characters in C are consecutive in the sentence. Take C as a word Calculate its confidence score (refer to paper)
13
Linear model Calculate score of C i j being a word by combine multiple segmentation results F(i, j) denotes the score of characters from i to j being a word. λ (1≤k≤K) are weights of K segmentation results. Conf k (i,j) (1≤k≤K) is the confidence score of the kth segmentation result. seg k (i, j) (1≤k≤K) is a two-valued function.
14
Decoding C i j and F(i, j) being represented in a lattice The best sequence is found by dynamic programming algorithm. Search a sequence of words with a maximum product of their scores.
15
Training parameter λ Initial point λ l (1≤l≤K): A point in K-dimensional parameter space is randomly selected. The parameters λ l are optimized through iterative process. In each step, only one parameter is optimized, while keeping all other parameters fixed.
16
Experiment setting Experimental data: NTCIR-10 Chinese-English parallel patent description sentences Annotation set: randomly selected 300 sentence pairs. 150 sentences used for training the lattice parameters. 150 sentences used for evaluation.
17
Evaluation We conduct evaluations from two aspects: Evaluation (1): accuracy of Chinese word segmentation (F-measure) Evaluation (2): translation quality of MT system (BLEU)
18
Evaluation(1) Method Precision [%] Recall [%] F-measure [%] Bilingually motivated segmenter 73.131261.448066.7825 1-best of CRF segmenter(baseline) 90.243990.771090.5067 Linear-model (our approach) 91.665091.861491.7631 Accuracy of Chinese word segmentation
19
Evaluation(2) We develop a phrase-based SMT with Moses, using different Chinese segmenters 1-best of CRF segmenter (baseline) Linear model (our approach) Stanford Chinese segmenter NLPIR Chinese segmenter
20
Evaluation (2): result SMT using different Chinese segmenterBLEU[%] 1-best of CRF segmenter (baseline)30.53 Linear model (our approach)31.15 Stanford Chinese segmenter30.98 NLPIR Chinese segmenter30.56 Our approach increased by 0.62% compared to baseline. Performance of our approach is better than the two popular segmenters.
21
Result Analysis CRF 1-best result Corresponding English word Linear-model result 甘 || 氨酸 Glycine 甘氨酸 聚合 || 物 Polymer 聚合物 碳原子 Carbon || atoms 碳 || 原子 碘复合物 Iodine || complex 碘 || 复合物 抗 || 微生物 Antimicrobial 抗微生物
22
Conclusion We propose a linear model to combine multiple segmentation results from two segmenters to augment domain-adaptation. one based on n-gram statistical feature of large Chinese raw corpus. the other one based on bilingually motivated features of parallel corpus. The experimental results show that both F-measure of CWS result and the BLEU score of SMT are improved.
23
Thanks! Q&A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.