Download presentation
Presentation is loading. Please wait.
Published byArline Alice Goodwin Modified over 9 years ago
1
NRC Report Conclusion Tu Zhaopeng 2009-09-08
2
NIST06 The Portage System For Chinese large-track entry, used simple, but carefully- tuned, phrase-based system: Pre-process source text Viterbi decoding using loglinear model Nbest rescoring using fancier loglinear model Post-process raw translation
3
NIST06 Pre-processing: Convert to GB2312, removing traditional characters with no GB2312 representation Segment using LDC segmenter Translate numbers and dates using rules Strip non-ASCII OOV’s
4
NIST06 Post-processing Truecase using 4-gram HMM (via SRILM disambig) trained on parallel corpus Detokenization heuristics
5
NIST06 Rescoring Rescoring based on 5k-best lists, using Powell’s algorithm to find max-BLEU weights Features (22) All 12 decoder features Character length IBM2 scores in both directions IBM1-based “missing word” feature (compare score of best translation for each word to best known) Posterior probabilities calculated from nbest list for: sentence length, phrases, words, unigrams, and bigrams.
6
NIST06 Search Parameters
7
NIST08 Towards Tighter Integration of Rule-based and Statistical MT in Serial System Combination Rule-based Systran Phrase-based Portage
8
NIST08 Annotation of Systran output, five different chunk types: named entities, numbers, dates unknown words or unlikely sequences of short words ‘strong’ rules : very reliable chunks, e.g., rules based on a long distance syntactic relationship, or a long multiword expression
9
NIST09 Serial system combination
10
NIST09 NRC system trained on SY/EN parallel corpus: use SYSTRAN to translate ZH half of parallel ZH/EN training corpus, discarding UN, HKH/L corpora for eciency ! 3M sentence pairs preprocess SY: strip markup, tokenize, lowercase standard phrase-based training
11
NIST09 Two strategies that didn't work: Exploit SY/EN surface similarity: boost HMM ttable scores of similar forms, prior to phrase extraction ! no improvement Use SY case information: adopt SY case for aligned EN words|no improvement compared to baseline independent truecaser
12
NIST09 Common features: phrase table based on symmetrized HMM word alignments (4 features: lex+rf, fwd+bkw) 5g mixture LM from parallel corpus (Foster & Kuhn, WMT07) 6g LM from GW word count and distortion
13
NIST09
14
Useful rescoring with IBM- and nbest-based features (Ueng and Ney, CL07; Chen et al, IWSLT05): +0.3 BLEU greedy feature pruning for rescoring +0.3 BLEU truecasing with \title trick": +0.3 BLEU
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.