METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Fast Algorithms For Hierarchical Range Histogram Constructions
Arthur Chan Prepared for Advanced MT Seminar
Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Analytic Comparisons & Trend Analyses Analytic Comparisons –Simple comparisons –Complex comparisons –Trend Analyses Errors & Confusions when interpreting.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Evaluation of Machine Translation Systems: Metrics and Methodology Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon.
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.
Data Sources The most sophisticated forecasting model will fail if it is applied to unreliable data Data should be reliable and accurate Data should be.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.
Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
The Developmental Reading & English Placement Test
Arthur Chan Prepared for Advanced MT Seminar
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
Evaluating the Output of Machine Translation Systems
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Linear Correlation. PSYC 6130, PROF. J. ELDER 2 Perfect Correlation 2 variables x and y are perfectly correlated if they are related by an affine transform.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin.
Evaluating the Output of Machine Translation Systems Alon Lavie Associate Research Professor, Carnegie Mellon University President, Safaba Translation.
Semantic Evaluation of Machine Translation Billy Wong, City University of Hong Kong 21 st May 2010.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
Machine Translation Course 10 Diana Trandab ă ț
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Build MT systems with Moses MT Marathon Americas 2016 Hieu Hoang.
Machine Translation Course 9
Multi-Engine Machine Translation
Statistical Machine Translation Part II: Word Alignments and EM
Vorlesung Maschinelle Übersetzung, SS 2010
Language Technologies Institute Carnegie Mellon University
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Translation Error Rate Metric
Reliability & Validity
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Build MT systems with Moses
Presentation 王睿.
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Introduction to Regression
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Presentation transcript:

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev Banerjee, Kenji Sagae, Shyamsundar Jayaraman Language Technologies Institute Carnegie Mellon University

Outline Similarity-based metrics for MT evaluation Weaknesses in Precision-based MT metrics (BLEU, NIST) Simple unigram-based MT evaluation metrics METEOR Evaluation Methodology Experimental Evaluation Recent Work Future Directions January 18, 2005 MT Group Lunch

Automatic Metrics for MT Evaluation Idea: compare output of an MT system to a “reference” good (usually human) translation: how close is the MT output to the reference translation? Advantages: Fast and cheap, minimal human labor, no need for bilingual speakers Can be used on an on-going basis during system development to test changes Disadvantages: Current metrics are very crude, do not distinguish well between subtle differences in systems Individual sentence scores are not very reliable, aggregate scores on a large test set are required Automatic metrics for MT evaluation very active area of current research January 18, 2005 MT Group Lunch

Automatic Metrics for MT Evaluation Example: Reference: “the Iraqi weapons are to be handed over to the army within two weeks” MT output: “in two weeks Iraq’s weapons will give army” Possible metric components: Precision: correct words / total words in MT output Recall: correct words / total words in reference Combination of P and R (i.e. F1= 2PR/(P+R)) Levenshtein edit distance: number of insertions, deletions, substitutions required to transform MT output to the reference Important Issues: Perfect word matches are too harsh: synonyms, inflections: “Iraq’s” vs. “Iraqi”, “give” vs. “handed over” January 18, 2005 MT Group Lunch

Similarity-based MT Evaluation Metrics Assess the “quality” of an MT system by comparing its output with human produced “reference” translations Premise: the more similar (in meaning) the translation is to the reference, the better Goal: an algorithm that is capable of accurately approximating the similarity Wide Range of metrics, mostly focusing on word-level correspondences: Edit-distance metrics: Levenshtein, WER, PIWER, … Ngram-based metrics: Precision, Recall, F1-measure, BLUE, NIST, GTM… Main Issue: perfect word matching is very crude estimate for sentence-level similarity in meaning January 18, 2005 MT Group Lunch

Desirable Automatic Metric High-levels of correlation with quantified human notions of translation quality Sensitive to small differences in MT quality between systems and versions of systems Consistent – same MT system on similar texts should produce similar scores Reliable – MT systems that score similarly will perform similarly General – applicable to a wide range of domains and scenarios Fast and lightweight – easy to run January 18, 2005 MT Group Lunch

The BLEU Metric Proposed by IBM [Papineni et al, 2002] Main ideas: Exact matches of words Match against a set of reference translations for greater variety of expressions Account for Adequacy by looking at word precision Account for Fluency by calculating n-gram precisions for n=1,2,3,4 No recall (because difficult with multiple refs) To compensate for recall: introduce “Brevity Penalty” Final score is weighted geometric average of the n-gram scores Calculate aggregate score over a large test set January 18, 2005 MT Group Lunch

The BLEU Metric Example: BLUE metric: Reference: “the Iraqi weapons are to be handed over to the army within two weeks” MT output: “in two weeks Iraq’s weapons will give army” BLUE metric: 1-gram precision: 4/8 2-gram precision: 1/7 3-gram precision: 0/6 4-gram precision: 0/5 BLEU score = 0 (weighted geometric average) January 18, 2005 MT Group Lunch

The BLEU Metric Clipping precision counts: Reference1: “the Iraqi weapons are to be handed over to the army within two weeks” Reference2: “the Iraqi weapons will be surrendered to the army in two weeks” MT output: “the the the the” Precision count for “the” should be clipped at two: max count of the word in any reference Modified unigram score will be 2/4 (not 4/4) January 18, 2005 MT Group Lunch

The BLEU Metric Brevity Penalty: Reference1: “the Iraqi weapons are to be handed over to the army within two weeks” Reference2: “the Iraqi weapons will be surrendered to the army in two weeks” MT output: “the the” Precision score: unigram 2/2, bigram 1/1, BLEU = 1.0 MT output is much too short, thus boosting precision, and BLEU doesn’t have recall… An exponential Brevity Penalty reduces score, calculated based on the aggregate length (not individual sentences) January 18, 2005 MT Group Lunch

Weaknesses in BLEU (and NIST) BLUE matches word ngrams of MT-translation with multiple reference translations simultaneously  Precision-based metric Is this better than matching with each reference translation separately and selecting the best match? BLEU Compensates for Recall by factoring in a “Brevity Penalty” (BP) Is the BP adequate in compensating for lack of Recall? BLEU’s ngram matching requires exact word matches Can stemming and synonyms improve the similarity measure and improve correlation with human scores? All matched words weigh equally in BLEU Can a scheme for weighing word contributions improve correlation with human scores? BLEU’s Higher order ngrams account for fluency and grammaticality, ngrams are geometrically averaged Geometric ngram averaging is volatile to “zero” scores. Can we account for fluency/grammaticality via other means? January 18, 2005 MT Group Lunch

Roadmap to a Desirable Metric Establishing a metric with much improved correlations with human judgment score at the sentence-level will go a long way towards our overall goals Our Approach: Explicitly align the words in the MT translation with their corresponding matches in the reference translation, allowing for: Exact matches, stemmed word matches, synonym and semantically-related word matches Combine unigram Precision and Recall to account for the similarity in “content” (translation adequacy) Weigh the contribution of matched words based on a measure related to their importance Estimate translation fluency/grammaticality based on explicit measure related to word-order, fragmentation and/or average length of matched ngrams January 18, 2005 MT Group Lunch

Unigram-based Metrics Unigram Precision: fraction of words in the MT that appear in the reference Unigram Recall: fraction of the words in the reference translation that appear in the MT F1= P*R/0.5*(P+R) Fmean = P*R/(0.9*P+0.1*R) With and without word stemming Match with each reference separately and select the best match for each sentence January 18, 2005 MT Group Lunch

The METEOR Metric New metric under development at CMU Main new ideas: Reintroduce Recall and combine it with Precision as score components Look only at unigram Precision and Recall Align MT output with each reference individually and take score of best pairing Matching takes into account word inflection variations (via stemming) Address fluency via a direct penalty: how fragmented is the matching of the MT output with the reference? January 18, 2005 MT Group Lunch

The METEOR Metric Matcher explicitly aligns matched words between MT and reference Multiple stages: exact matches, stemmed matches, (synonym matches) Matcher returns fragment count – used to calculate average fragmentation (frag) METEOR score calculated as a discounted Fmean score Discounting factor: DF = 0.5 * (frag**3) Final score: Fmean * (1- DF) January 18, 2005 MT Group Lunch

METEOR Metric Effect of Discounting Factor: January 18, 2005 MT Group Lunch

The METEOR Metric Example: P = 5/8 =0.625 R = 5/14 = 0.357 Reference: “the Iraqi weapons are to be handed over to the army within two weeks” MT output: “in two weeks Iraq’s weapons will give army” Matching: Ref: Iraqi weapons army two weeks MT: two weeks Iraq’s weapons army P = 5/8 =0.625 R = 5/14 = 0.357 Fmean = 10*P*R/(9P+R) = 0.3731 Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50 Discounting factor: DF = 0.5 * (frag**3) = 0.0625 Final score: Fmean * (1- DF) = 0.3731*0.9375 = 0.3498 January 18, 2005 MT Group Lunch

BLEU vs METEOR How do we know if a metric is better? Better correlation with human judgments of MT output Reduced score variability on MT outputs that are ranked equivalent by humans Higher and less variable scores on scoring human translations against the reference translations January 18, 2005 MT Group Lunch

Evaluation Methodology Correlation of metric scores with human scores at the system level Human scores are adequacy+fluency [2-10] Pearson correlation coefficients Confidence ranges for the correlation coefficients Correlation of score differentials between all pairs of systems [Coughlin 2003] Assumes a linear relationship between the score differentials January 18, 2005 MT Group Lunch

Evaluation Setup Data: DARPA/TIDES 2002 and 2003 Chinese-to-English MT evaluation data 2002 data: ~900 sentences, 4 reference translations 7 systems 2003 data: 6 systems Metrics Compared: BLEU, NIST, P, R, F1, Fmean, GTM, B&H, METEOR January 18, 2005 MT Group Lunch

Evaluation Results: 2002 System-level Correlations Metric Pearson Coefficient Confidence Interval BLEU 0.461 ±0.058 BLEU-stemmed 0.528 ±0.061 NIST 0.603 ±0.049 NIST-stemmed 0.740 ±0.043 Precision 0.175 ±0.052 Precision-stemmed 0.257 ±0.065 Recall 0.615 ±0.042 Recall-stemmed 0.757 F1 0.425 ±0.047 F1-stemmed 0.564 Fmean 0.585 Fmean-stemmed 0.733 ±0.044 GTM 0.77 GTM-stemmed 0.68 METEOR 0.7191 January 18, 2005 MT Group Lunch

Evaluation Results: 2003 System-level Correlations Metric Pearson Coefficient Confidence Interval BLEU 0.817 ±0.021 BLEU-stemmed 0.843 ±0.018 NIST 0.892 ±0.013 NIST-stemmed 0.915 ±0.010 Precision 0.683 ±0.041 Precision-stemmed 0.752 Recall 0.961 ±0.011 Recall-stemmed 0.940 ±0.014 F1 0.909 ±0.025 F1-stemmed 0.948 Fmean 0.959 ±0.012 Fmean-stemmed 0.952 GTM 0.79 GTM-stemmed 0.89 METEOR 0.964 January 18, 2005 MT Group Lunch

Evaluation Results: 2002 Pairwise Correlations Metric Pearson Coefficient Confidence Interval BLEU 0.498 ±0.054 BLEU-stemmed 0.559 ±0.058 NIST 0.679 ±0.042 NIST-stemmed 0.774 ±0.041 Precision 0.298 ±0.051 Precision-stemmed 0.325 ±0.064 Recall 0.743 ±0.032 Recall-stemmed 0.845 ±0.029 F1 0.549 F1-stemmed 0.643 ±0.046 Fmean 0.711 ±0.033 Fmean-stemmed 0.818 GTM 0.71 GTM-stemmed 0.69 METEOR 0.8115 January 18, 2005 MT Group Lunch

Evaluation Results: 2003 Pairwise Correlations Metric Pearson Coefficient Confidence Interval BLEU 0.758 ±0.027 BLEU-stemmed 0.793 ±0.025 NIST 0.886 ±0.017 NIST-stemmed 0.924 ±0.013 Precision 0.573 ±0.053 Precision-stemmed 0.666 ±0.058 Recall 0.954 ±0.014 Recall-stemmed 0.923 ±0.018 F1 0.881 ±0.024 F1-stemmed 0.950 Fmean ±0.015 Fmean-stemmed 0.940 GTM 0.78 GTM-stemmed 0.86 METEOR 0.965 January 18, 2005 MT Group Lunch

METEOR vs. BLEU Sentence-level Scores (CMU SMT System, TIDES 2003 Data) January 18, 2005 MT Group Lunch

METEOR vs. BLEU Histogram of Scores of Reference Translations 2003 Data Mean=0.3727 STD=0.2138 Mean=0.6504 STD=0.1310 BLEU METEOR January 18, 2005 MT Group Lunch

Further Issues Words are not created equal – some are more important for effective translation More effective matching with synonyms and inflected forms: Stemming Use a synonym knowledge-base (WordNet) How to incorporate such information within the metric? Train weights for word matches Different weights for “content” and “function” words January 18, 2005 MT Group Lunch

Some Recent Work [Bano] Further experiments with advanced features: With and without stemming With and without WordNet Synonyms With and without small “Stop-list” January 18, 2005 MT Group Lunch

New Evaluation Results: 2002 System-level Correlations Metric Pearson Coefficient METEOR 0.6016 METEOR+Pstem 0.7177 METEOR+WNstem 0.7275 METEOR+Pst+syn 0.8070 METEOR+WNst+syn 0.7966 METEOR+stop 0.7597 METEOR+Pst+stop 0.8638 METEOR+WNst+stop 0.8783 METEOR+Pst+syn+stop 0.9138 METEOR+WNst+syn+stop 0.9126 F1+Pstem+syn+stop 0.8872 Fmean+WNst+syn+stop 0.8843 Fmean+Pst+syn+stop 0.8799 F1+WNstem+syn+stop 0.8787 January 18, 2005 MT Group Lunch

New Evaluation Results: 2003 System-level Correlations Metric Pearson Coefficient METEOR 0.9175 METEOR+Pstem 0.9621 METEOR+WNstem 0.9705 METEOR+Pst+syn 0.9852 METEOR+WNst+syn 0.9821 METEOR+stop 0.9286 METEOR+Pst+stop 0.9682 METEOR+WNst+stop 0.9764 METEOR+Pst+syn+stop METEOR+WNst+syn+stop 0.9802 Fmean+WNstem 0.9924 Fmean+WNstem+stop 0.9914 Fmean+Pstem 0.9905 Fmean+WNstem+syn January 18, 2005 MT Group Lunch

Current and Future Directions Word weighing schemes Word similarity beyond synonyms Optimizing the fragmentation-based discount factor Alternative metrics for capturing fluency and grammaticality January 18, 2005 MT Group Lunch