1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011.

1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011

11-711 Machine Translation2 Overview lMBR – Minimum Bayes Risk Decoding lMIRA – Margin Infused Relaxed Algorithm

11-711 Machine Translation3 Optimization lWe learned about tuning the MT system lDecoder used a set of feature functions lOptimize towards MT metric by adjusting feature weights lDifferent optimization approaches lSimplex lPowell lMERT a la Och lProblem lOnly work well for small number (<30) of features

11-711 Machine Translation4 MBR – Minimum Bayes Risk

11-711 Machine Translation5 MBR for Translation lTranslate source sentence f into target sentence e lDecoder generates many alternative translations lSelect least risky translation: True distribution Loss Function Evidence Space Hypothesis space

11-711 Machine Translation6 Hypothesis Space lHypothesis Space: We can only select a translation, which we have generated lDecoder prunes most of the possible hypotheses lWe typically generate n-best translations from the search graph lSearch graph contains many more (typically > J 10*k, J sentence length and k reordering window) paths lUeffing et al. 2002: Generation of Word Graphs in Statistical Machine Translation, describe generation of output lattice from search graph lTypically n-best list is used (e.g. in Moses package) lTromble et al. describe lattice MBR Note about terminology: lI used ‘Translation Lattice’ for the lattice, which includes all phrase translations for the source sentence lOthers use ‘Translation Lattice’ for the output word graph

11-711 Machine Translation7 The Loss Function lLoss function gives ‘cost’ for generating a wrong translation lKumar & Byrne (2004) studies different loss functions lLexical: compare on the word level only, e.g. WER, PER, 1-BLEU lTarget language parse tree, e.g. tree edit distance between parse trees lBilingual parse tree: uses information from word strings, alignments and parse-trees in both languages lEhling, Zens & Ney (2007) use BLEU lAny automatic MT evaluation metric – or appropriate approximation - can be used lSome metrics, like BLEU, are defined on test set lNeed sentence-level approximation lMay require some ‘smoothing’, e.g. simple count +1 smoothing in BLEU

11-711 Machine Translation8 Probability Distribution lWe don’t have the true distribution lApproximate with model distribution Use scaling factor  to smooth the distribution  <1 flattens  > 1 sharpens lNeeds to be tuned (simple line search)

11-711 Machine Translation9 Evidence Space lSummation over ‘all’ translations for source sentence f lCan use more translations than those from which we want to collect lN-best list e.g. use top 10k as evidence, but only top 1k to select new 1-best lEntire Lattice

11-711 Machine Translation10 MBR on n-best List for (iter = nBestList.begin() ; iter != nBestList.end() ; ++iter) { joint_prob = … marginal += joint_prob; } /* Main MBR computation done here */ for (unsigned int i = 0; i < nBestList.GetSize(); i++){ weightedLossCumul = 0; for (unsigned int j = 0; j < nBestList.GetSize(); j++){ if ( i != j) { bleu = calculate_score(translations, j, i,ngram_stats ); weightedLoss = ( 1 - bleu) * ( joint_prob_vec[j]/marginal); weightedLossCumul += weightedLoss; if (weightedLossCumul > minMBRLoss) break; } if (weightedLossCumul < minMBRLoss){ minMBRLoss = weightedLossCumul; minMBRLossIdx = i; } /* Find sentence that minimises Bayes Risk under 1- BLEU loss */ return translations[minMBRLossIdx]; }

11-711 Machine Translation11 MBR on n-best List (From Moses::scripts/training/mbr/mbr.cpp) void process(int sent, const vector & sents){ for (int i = 0; i < sents.size(); i++){ //Calculate marginal and cache the posteriors joint_prob = calculate_probability(sents[i]->features,weights,SCALE); marginal += joint_prob; … } … /* Main MBR computation done here */ for (int i = 0; i < sents.size(); i++){ weightedLossCumul = 0; for (int j = 0; j < sents.size(); j++){ if ( i != j) { bleu = calculate_score(sents, j, i,ngram_stats ); weightedLoss = ( 1 - bleu) * ( joint_prob_vec[j]/marginal); weightedLossCumul += weightedLoss; if (weightedLossCumul > minMBRLoss) break; } if (weightedLossCumul < minMBRLoss) { minMBRLoss = weightedLossCumul; minMBRLossIdx = i; } }

11-711 Machine Translation12 MBR on n-best List lRuntime is O(N h *N e ) lwith N h is number of (top-ranking) hypotheses considered for selection lAnd N e the number of hypotheses summed over lTypically, full n-best list for hypothesis and evidence space lRuntime quadratic in n-best list size

11-711 Machine Translation13 MBR on Lattice lA bit more complicated :-) lCannot enumerate all paths – need local loss (gain) function lTromble et al show how this can be done lAssume gain function can be written as sum of local gain functions, i.e gains for individual n-grams lCalculate local gain function in terms of n-gram posterior lReduce summation over exponentially many paths to summation over number of n-grams, which is polynomial worst case lDifferent approximations to test set BLEU lSame parameters need to be tuned on (second) development set l2 pass decoding lPass 1: standard decoding with lattice generation lPass 2: MBR decoding over lattice

11-711 Machine Translation14 MBR on Lattice - Results Ar-EnCh-EnEn-Ch MAP43.727.941.4 N-best MBR43.928.342.0 Lattice MBR44.928.542.6 lReported are Bleu scores on NIST 2008 test set lMBR on lattice outperforms MBR on n-best list outperforms MAP decoding

11-711 Machine Translation15 Hypothesis Space and Evidence Space Hyp SpaceEvid SpaceAr-EnCh-EnEn-Ch Lattice 44.928.542.6 1000-bestLattice44.628.542.6 Lattice1000-best44.128.042.1 1000-best 44.228.142.2 lLarger evidence space is more important then large hypothesis space lNotice: this experiment used a different BLEU approximation, giving higher scores

11-711 Machine Translation16 Tuning MBR Decoder Tuning parameter  is important Flatting distribution (  <1) makes it easier to select other then the MAP hypothesis

11-711 Machine Translation17 MBR for System Combination lWe have seen system combination based on combined n-best lists le.g. Silja’s Hypothesis Selection system lEssentially n-best list rescoring on combined n-best list lMBR works on n-best translations -> can be used to combine systems lExample: Gispert et al. 2008, MBR Combination of Translation Hypotheses from Alternative Morphological Decomposition lPreprocessing with MADA and Sakhr tagger lBuild 2 translation systems lMBR combination mt02-mt05 DevTestmt08 MADA-based53.352.743.7 +MBR53.753.344.0 SAKHR-based52.752.843.3 +MBR53.2 43.8 MBR-combi54.6 45.6

11-711 Machine Translation18 MBR for System Combination lIn-house experiments combining 200-best lists of 3 decoders lResults on newswire test set lMBR improvements for individual systems ~0.4-0.5 BLEU, 0.1-0.4 TER lMBR-combi improvement over best: 0.9 BLEU, 0.6 TER SystemTERBLEU PSMT160.7928.82 PSMT260.7627.09 SAMT60.2728.98 PSMT1-MBR60.5129.36 PSMT2-MBR60.5727.61 SAMT-MBR60.1129.40 MBR-Combi59.5330.23

11-711 Machine Translation19 MIRA – Margin Infused Relaxed Algorithm

11-711 Machine Translation20 MIRA lOnline large margin discriminative training lOnline: update after each training example lLarge margin: move training examples away from ‘gray’ aread lDiscriminative: compare against alternatives lalso means that its supervised lOriginally described by Crammer and Singer, 2003 lApplied to statistical MT by Watanabe et al., 2007 lAlso used by Chiang et al, 2009

11-711 Machine Translation21 Training Algorithm Generate N-best translations Update oracle list Update feature weighs Return averaged feature weighs

11-711 Machine Translation22 Weight Update lSlack variable C >=0; larger C means larger updates to weight vector lL(…) is loss function, e.g. loss in Bleu largmin mean that we want to change the weights only as much as needed

11-711 Machine Translation23 MIRA Model Score Metric Score margin loss

11-711 Machine Translation24 MIRA Model Score Metric Score

11-711 Machine Translation25 Results (Chiang 2009) lGale 2008 Chinese - English SystemTrainingFeaturesBLEU HieroMERT1136.1 MIRA10,99037.6 SyntaxMERT2539.5 MIRA28540.6

11-711 Machine Translation26 Summary lMBR decoding lSelect less risky hypothesis from n-best list (or lattice) lMIRA lOptimize feature weights for decoder lWorks with very large number of features

1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011.

Similar presentations

Presentation on theme: "1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011.

Similar presentations

Presentation on theme: "1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011."— Presentation transcript:

Similar presentations

About project

Feedback