NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

NRC Report Conclusion Tu Zhaopeng 2009-09-08

NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based system:  Pre-process source text  Viterbi decoding using loglinear model  Nbest rescoring using fancier loglinear model  Post-process raw translation

NIST06  Pre-processing:  Convert to GB2312, removing traditional characters with no GB2312 representation  Segment using LDC segmenter  Translate numbers and dates using rules  Strip non-ASCII OOV’s

NIST06  Post-processing  Truecase using 4-gram HMM (via SRILM disambig) trained on parallel corpus  Detokenization heuristics

NIST06  Rescoring  Rescoring based on 5k-best lists, using Powell’s algorithm to find max-BLEU weights  Features (22)  All 12 decoder features  Character length  IBM2 scores in both directions  IBM1-based “missing word” feature (compare score of best translation for each word to best known)  Posterior probabilities calculated from nbest list for: sentence length, phrases, words, unigrams, and bigrams.

NIST06  Search Parameters

NIST08  Towards Tighter Integration of Rule-based and Statistical MT in Serial System Combination  Rule-based  Systran  Phrase-based  Portage

NIST08  Annotation of Systran output, five different chunk types:  named entities, numbers, dates  unknown words or unlikely sequences of short words  ‘strong’ rules : very reliable chunks, e.g., rules based on a long distance syntactic relationship, or a long multiword expression

NIST09  Serial system combination

NIST09  NRC system trained on SY/EN parallel corpus:  use SYSTRAN to translate ZH half of parallel ZH/EN training corpus, discarding UN, HKH/L corpora for eciency ! 3M sentence pairs  preprocess SY: strip markup, tokenize, lowercase  standard phrase-based training

NIST09  Two strategies that didn't work:  Exploit SY/EN surface similarity: boost HMM ttable scores of similar forms, prior to phrase extraction ! no improvement  Use SY case information: adopt SY case for aligned EN words|no improvement compared to baseline independent truecaser

NIST09  Common features:  phrase table based on symmetrized HMM word alignments (4 features: lex+rf, fwd+bkw)  5g mixture LM from parallel corpus (Foster & Kuhn, WMT07)  6g LM from GW  word count and distortion

NIST09

 Useful  rescoring with IBM- and nbest-based features (Ueng and Ney, CL07; Chen et al, IWSLT05): +0.3 BLEU  greedy feature pruning for rescoring +0.3 BLEU  truecasing with \title trick": +0.3 BLEU

NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

Similar presentations

Presentation on theme: "NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

Similar presentations

Presentation on theme: "NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based."— Presentation transcript:

Similar presentations

About project

Feedback