METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie Mellon University
METEOR Originally developed in 2005 as an automatic metric designed for higher correlation with human judgments at the sentence level Main ingredients: Extended Matching between translation and reference Unigram Precision, Recall parameterized F-measure Reordering Penalty Parameters can be tuned to optimize correlation with human judgments Our previous work established improved correlation with human judgments (compared with BLEU and other metrics) Not biased against “non-statistical” MT systems Only metric that correctly ranked NIST MT Eval-06 Arabic systems Was used as primary metric in DARPA TRANSTAC and ET-07 Evaluations, one of several metrics in NIST MT Eval, IWSLT, WMT Main innovation in latest version (used for WMT-08): Ranking-based Parameter optimization June 19, 2008METEOR and M-BLEU2
METEOR – Flexible Word Matching Words in reference translation and MT hypothesis are matched using a series of modules with increasingly loose criteria of matching Exact match Porter Stemmer Word Net based Synonymy A word-to-word alignment for the sentence pair is computed using these word level matchings NP-hard in general, uses fast approximate search June 19, 20083METEOR and M-BLEU
Alignment Example The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister June 19, 20084METEOR and M-BLEU
Alignment Example The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister June 19, 20085METEOR and M-BLEU
Alignment Example The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister June 19, 20086METEOR and M-BLEU
Alignment Example The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister June 19, 20087METEOR and M-BLEU
METEOR : Score Computation Weighted combination of the unigram precision and recall A fragmentation penalty to address fluency A “chunk” is a monotonic sequence of aligned words Final Score June 19, 20088METEOR and M-BLEU
METEOR Parameter Tuning The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements. Since the ranges of the parameters are bounded, we perform an exhaustive search Current official release of METEOR was tuned to obtain good correlations with adequacy and fluency human judgements For WMT-08, we re-tuned to optimize correlation with human ranking data released from last year's WMT shared task June 19, 20089METEOR and M-BLEU
Computing Ranking Correlation Convert binary judgements into full rankings Reject equal judgements Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements Topologically sort the graph For one source sentence with N hypotheses Average across all source sentences June 19, METEOR and M-BLEU
Results 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation) Results on the WMT 2008 data( % of correct binary judgments) June 19, METEOR and M-BLEU
Flexible Matching for BLEU,TER The flexible matching in METEOR can be used to extend any metric that is based on word overlap between translation and reference(s) Compute the alignment between reference and hypothesis using the METEOR matcher Create a “targeted” reference by substituting words in the reference with their matched equivalences from the translation hypothesis. Compute any metric (BLEU, TER, etc.) with the new targeted references June 19, METEOR and M-BLEU
M-BLEU : Results Average BLEU and M-BLEU scores in WMT 2008 data No consistent gains seen in correlations at the segment level on WMT 2007 data Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008]) June 19, METEOR and M-BLEU
Discussion and Future Work METEOR has consistently demonstrated improved levels of correlation with human judgements in multiple evaluations in recent years Simple and relatively fast to compute Some minor issues for using it in MERT being resolved Use it to evaluate your system, even if you tune to BLEU! Performance of MT metrics varies quite a lot across languages, genres and years Partially due to no good methodology for evaluation of these metrics More sophisticated paraphrase detection methods (multi-word correspondences, such as compounds in German) would be useful for languages June 19, METEOR and M-BLEU
Thank You ! Questions ? June 19, METEOR and M-BLEU