Presentation is loading. Please wait.

Presentation is loading. Please wait.

The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.

Similar presentations


Presentation on theme: "The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan."— Presentation transcript:

1 The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan Lu, Qun Liu Institute of Computing Technology Chinese Academy of Sciences 2007.09.15– 2007.08.16

2 Outline Overview MT Systems Bruin Confucius Lynx Official Evaluation Discussion Summary

3 Introduction of Our Group Multilingual Interaction Technology Laboratory, Institute of Computing Technology, Chinese Academy Sciences Long history for working on MT Rule-based Example-based Focus on SMT from 2004 Website: http://mtgroup.ict.ac.cn/http://mtgroup.ict.ac.cn/

4 People Working on SMT at ICT Staffs Qun Liu (Researcher) Yajuan Lu (Associate Researcher) Yang Liu (Associate Researcher) Weihua Luo (Assistant Researcher) PhD Students Zhongjun He Haitao Mi Jinsong Su Yang Feng Master Students Yun Huang Wenbin Jiang Zhixiang Ren …

5 IWSLT 2007 Evaluation Chinese-English transcript translation task

6 Systems for IWSLT 2007 Evaluation MT Systems: Bruin (formally syntax-based) Confucius (extended phrase-based) Lynx (linguistically syntax-based)

7 Outline Overview MT Systems Bruin Lynx Confucius Official Evaluation Discussion Summary

8 Bruin Bruin is a formally syntax-based system MaxEnt Reordering Model build on BTG rules Regard reordering as a binary classification Building a MaxEnt-based classifier Using boundary words instead of whole phrases as features for the classifier target source straightinverted

9 Features Source and target boundary words (lexical feature) Combinations of boundary words (collocation feature) C1 E1 C2 E2 Target boundary head words Source boundary head words b1b1 b2b2

10 Training and Decoding Training the model Learning reordering examples from bilingual word-aligned corpus Generating features from reordering examples Training a MaxEnt model on the features Decoding CKY algorithm For details, see Xiong et al., ACL2006

11 Confucius An extended phrase-based system Log-linear model Monotone decoding We try a phrase-based similarity model, in which a translation for a certain source phrase can be applied for other similar phrases

12 Phrase-based Similarity Model 全省出口总值的 25.5% Find the most similar phrase pair 全市出口总值的半数 halfoftheentirecity'sexportvolume

13 Phrase-based Similarity Model 全省出口总值的 25.5% Compare 全市出口总值的半数 halfoftheentirecity'sexportvolume

14 Phrase-based Similarity Model 全省出口总值的 25.5% Replace 全省出口总值的 25.5% oftheentire province'sexportvolume

15 Lynx A linguistically syntax-based system Based on tree-to-string alignment template (TAT), which map the source language tree to target language string Log-linear Model

16 Translation Process: Parsing 中国 的 经济 发展 NP DNPNP DEGNN NR 中国 的经济发展 parsing

17 Translation Process: Detachment NP DNPNP DEGNN NR 中国 的经济发展 NP DNPNP DEG 的 NP NR 中国 NP NN 经济发展 detachment NP

18 Translation Process : Production NP DNPNP DEG 的 NP NR 中国 NP NN 经济发展 of China economicdevelopment

19 Translation Process : Combination NP DNPNP DEG 的 NP NR 中国 NP NN 经济发展 of China economicdevelopment economic development of China combination NP

20 Training and Decoding Training Extract TATs from word-aligned, source side parsed bilingual corpus using bottom-up strategy Impose several restrictions to decrease the magnitude Decoding bottom-up beam search For details, see Liu et al., ACL2006

21 Outline Overview MT Systems Bruin Lynx Confucius Official Evaluation Discussion Summary

22 Toolkits Used Word alignment GIZA++ plus “grow-diag-final” refinement method Language model SRILM Chinese parser Deyi Xiong’s A lexicalized PCFG model trained on PennTree bank Chinese word segmentation ICTCLAS

23 Preprocessing and Postprocessing Preprocessing Chinese word segmentation Rule-based translations of numbers, dates and Chinese names Chinese sentences Parsing (for Lynx only) Postprocessing Remove unknown words Capitalize the first word of each sentence

24 Training data Training Data List

25 Development and test set Corpus statistics of the IWSLT 2006 and 2007 development and test set

26 Results on IWSLT 2006 development set and test set Small data: The training data released by the IWSLT 2007 Large data: All the training data

27 Results on IWSLT 2007 test set

28 Outline Overview MT Systems Bruin Lynx Confucius Official Evaluation Discussion Summary

29 Discussion Lynx(0.1777) Training Corpus: Training data: –About 39k sentence pairs dialogs data –Provided by IWSLT 2007 –About 5M sentence pairs newswire data –Released by LDC Domain is quite different –Newswire vs. Dialogs Newswire data is too large

30 Discussion Lynx (0.1777) Parser : Trained on Penn Chinese Treebank Domain is quite different too –Newswire vs. Dialogs Parsing error (low performance of parser) Lynx decoder –Only depends on the 1-best parsing tree

31 Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better

32 Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better

33 Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better 2006 tst2007 tst Bruin0.22830.3750 Confucius0.20420.2802

34 Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better 2006 tst2007 tst Bruin0.22830.3750 Confucius0.20420.2802

35 Discussion Models: Bruin (0.3750) MaxEnt based reordering model Long distance word reordering Confucius (0.2802) Monotone search 2007 test set (2006 test set) 6.7words/sent (12.7words/sent) –Bruin will do better Punctuation marks (no ) –More positive reordering information –Bruin will do better 2006 tst2007 tst Bruin0.22830.3750 Confucius0.20420.2802

36 Results on IWSLT 2007 test set

37 Outline Overview Systems Bruin Lynx Confucius Official Evaluation Discussion Summary

38 MT 3 systems based on different translation models: MaxEnt BTG Model TAT model Phrase-based Similarity Model Future Work More new model System combination

39 References Yang Liu, Qun Liu, and Shouxun Lin. 2006. Tree-to- String Alignment Template for Statistical Machine Translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 609-616, Sydney, Australia, July. Deyi Xiong, Qun Liu, and Shouxun Lin. 2006. Maximum Entropy Based Phrase Reordering Model for Statistical Machine Translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 521-528, Sydney, Australia, July.

40 Thanks!


Download ppt "The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan."

Similar presentations


Ads by Google