Presentation is loading. Please wait.

Presentation is loading. Please wait.

Saab Mansour and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University, Aachen, Germany NAACL-HLT.

Similar presentations


Presentation on theme: "Saab Mansour and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University, Aachen, Germany NAACL-HLT."— Presentation transcript:

1 Saab Mansour and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University, Aachen, Germany NAACL-HLT 2013

2 Introduction  Domain-adaption 是利用某一個 domain 內的 data 來提高 TM model 在 test domain 的 performance.  TM adaption: 建立一個 general domain phrase table, 利用 in-domain data 修改 phrase probabilities.

3 Introduction  使用的 corpus 為 IWSLT(International Workshop On Spoken Language Translation) TED(Technology Entertainment Design) tasks 內的 Arabic-to-English 和 German-to-English.

4 Phrase Training  用 Forced alignment (FA) 來執行 phrase segmentation, alignment training 和 probability estimation.  用 SMT 來做 phrase training, 對一個 training set y, 產生 heuristic-based phrase table P y 0, 經過 FA training(sentence 會被 segmentation 和 alignment), 根據 output 來重新估計 phrase 的機率值, 產生新的 phrase table p’.

5 Adaption  對一個 training set y’, 產生 initial phrase table P y’ 0, 對 yin(in-domain training data) 做 FA training, bias the probability to in-domain, procedure 表示為 X-FA-IN.  用 leaving-one-out 來避免 over-fitting.

6 Experimental Setup  Training Corpora:  Arabic-to-English:  In-domain: 90K TED sentences  Other-domain: 7.9M sentences of United Nation data  German-to-English:  In-domain: 130K TED sentences  Other-domain: 2.1M sentences from news- commentary and europarl corpora

7

8 Experimental Setup – Translation System  Baseline system: built using SMT toolkit Jane 2.0  Measures: BLEU, TER.  Arabic-English results are case sensitive German-English results are case insensitive

9 Results  Heuristics: IN,OD,ALL  standard phrase extraction using word-alignment training and heuristic phrase extraction over the word alignment.  FA standard: IN-FA,OD-FA,ALL-FA  standard FA phrase training where the same training set is used for initial phrase table generation as well as the FA procedure.  FA adaptation: OD-FA0-IN, ALL-FA-IN  FA based adaptation phrase training, where the initial table is generated from some general data and the FA training is performed on the IN data to achieve adaptation.

10 Results - measures  BLEU: (Bilingual Evaluation Understudy)  Candidate: the the the the the the the.  Reference 1: The cat is on the mat.  Reference 2: There is a cat on the mat.  Standard unigram precision: 7/7  Modified unigram precision: 2/7

11 Results - measures  TER: translation edit rate  REF: SAUDI ARABIA denied THIS WEEK information published in the AMERICAN new york times  HYP: THIS WEEK THE SAUDIS denied information published in the new york times  TER = 4/13  4 (1 Shift, 2 Substitutions, and 1 Insertion)

12

13 Mixture Modeling  Linear interpolation of IN and OD, IN and OD- FA0-IN, weight is uniform(0.5).

14

15 Conclusion  提出 phrase training procedure for adaptation using FA method.  對 Arabic-to-English 和 German-to-English TED lectures translation tasks, 都提高了 performance, BLEU 在 development set 提高 0.6%, TER 分別在 test, eval sets 減少了 0.8% 和 0.6%  最後用 mixture model 來比較, 結果顯示 adapted OD table performance 較 unadpated 的 OD table 好.


Download ppt "Saab Mansour and Hermann Ney Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University, Aachen, Germany NAACL-HLT."

Similar presentations


Ads by Google