Download presentation
Presentation is loading. Please wait.
Published byHubert Norman Modified over 9 years ago
1
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan Lu, Qun Liu Institute of Computing Technology Chinese Academy of Sciences 2007.09.15– 2007.08.16
2
Outline Overview MT Systems Bruin Confucius Lynx Official Evaluation Discussion Summary
3
Introduction of Our Group Multilingual Interaction Technology Laboratory, Institute of Computing Technology, Chinese Academy Sciences Long history for working on MT Rule-based Example-based Focus on SMT from 2004 Website: http://mtgroup.ict.ac.cn/http://mtgroup.ict.ac.cn/
4
People Working on SMT at ICT Staffs Qun Liu (Researcher) Yajuan Lu (Associate Researcher) Yang Liu (Associate Researcher) Weihua Luo (Assistant Researcher) PhD Students Zhongjun He Haitao Mi Jinsong Su Yang Feng Master Students Yun Huang Wenbin Jiang Zhixiang Ren …
5
IWSLT 2007 Evaluation Chinese-English transcript translation task
6
Systems for IWSLT 2007 Evaluation MT Systems: Bruin (formally syntax-based) Confucius (extended phrase-based) Lynx (linguistically syntax-based)
7
Outline Overview MT Systems Bruin Lynx Confucius Official Evaluation Discussion Summary
8
Bruin Bruin is a formally syntax-based system MaxEnt Reordering Model build on BTG rules Regard reordering as a binary classification Building a MaxEnt-based classifier Using boundary words instead of whole phrases as features for the classifier target source straightinverted
9
Features Source and target boundary words (lexical feature) Combinations of boundary words (collocation feature) C1 E1 C2 E2 Target boundary head words Source boundary head words b1b1 b2b2
10
Training and Decoding Training the model Learning reordering examples from bilingual word-aligned corpus Generating features from reordering examples Training a MaxEnt model on the features Decoding CKY algorithm For details, see Xiong et al., ACL2006
11
Confucius An extended phrase-based system Log-linear model Monotone decoding We try a phrase-based similarity model, in which a translation for a certain source phrase can be applied for other similar phrases
12
Phrase-based Similarity Model 全省出口总值的 25.5% Find the most similar phrase pair 全市出口总值的半数 halfoftheentirecity'sexportvolume
13
Phrase-based Similarity Model 全省出口总值的 25.5% Compare 全市出口总值的半数 halfoftheentirecity'sexportvolume
14
Phrase-based Similarity Model 全省出口总值的 25.5% Replace 全省出口总值的 25.5% oftheentire province'sexportvolume
15
Lynx A linguistically syntax-based system Based on tree-to-string alignment template (TAT), which map the source language tree to target language string Log-linear Model
16
Translation Process: Parsing 中国 的 经济 发展 NP DNPNP DEGNN NR 中国 的经济发展 parsing
17
Translation Process: Detachment NP DNPNP DEGNN NR 中国 的经济发展 NP DNPNP DEG 的 NP NR 中国 NP NN 经济发展 detachment NP
18
Translation Process : Production NP DNPNP DEG 的 NP NR 中国 NP NN 经济发展 of China economicdevelopment
19
Translation Process : Combination NP DNPNP DEG 的 NP NR 中国 NP NN 经济发展 of China economicdevelopment economic development of China combination NP
20
Training and Decoding Training Extract TATs from word-aligned, source side parsed bilingual corpus using bottom-up strategy Impose several restrictions to decrease the magnitude Decoding bottom-up beam search For details, see Liu et al., ACL2006
21
Outline Overview MT Systems Bruin Lynx Confucius Official Evaluation Discussion Summary
22
Toolkits Used Word alignment GIZA++ plus “grow-diag-final” refinement method Language model SRILM Chinese parser Deyi Xiong’s A lexicalized PCFG model trained on PennTree bank Chinese word segmentation ICTCLAS
23
Preprocessing and Postprocessing Preprocessing Chinese word segmentation Rule-based translations of numbers, dates and Chinese names Chinese sentences Parsing (for Lynx only) Postprocessing Remove unknown words Capitalize the first word of each sentence
24
Training data Training Data List
25
Development and test set Corpus statistics of the IWSLT 2006 and 2007 development and test set
26
Results on IWSLT 2006 development set and test set Small data: The training data released by the IWSLT 2007 Large data: All the training data
27
Results on IWSLT 2007 test set
28
Outline Overview MT Systems Bruin Lynx Confucius Official Evaluation Discussion Summary
29
Discussion Lynx(0.1777) Training Corpus: Training data: –About 39k sentence pairs dialogs data –Provided by IWSLT 2007 –About 5M sentence pairs newswire data –Released by LDC Domain is quite different –Newswire vs. Dialogs Newswire data is too large
30
Discussion Lynx (0.1777) Parser : Trained on Penn Chinese Treebank Domain is quite different too –Newswire vs. Dialogs Parsing error (low performance of parser) Lynx decoder –Only depends on the 1-best parsing tree
31
Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better
32
Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better
33
Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better 2006 tst2007 tst Bruin0.22830.3750 Confucius0.20420.2802
34
Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better 2006 tst2007 tst Bruin0.22830.3750 Confucius0.20420.2802
35
Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better 2006 tst2007 tst Bruin0.22830.3750 Confucius0.20420.2802
36
Results on IWSLT 2007 test set
37
Outline Overview Systems Bruin Lynx Confucius Official Evaluation Discussion Summary
38
MT 3 systems based on different translation models: MaxEnt BTG Model TAT model Phrase-based Similarity Model Future Work More new model System combination
39
References Yang Liu, Qun Liu, and Shouxun Lin. 2006. Tree-to- String Alignment Template for Statistical Machine Translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 609-616, Sydney, Australia, July. Deyi Xiong, Qun Liu, and Shouxun Lin. 2006. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 521-528, Sydney, Australia, July.
40
Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.