METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Slides:



Advertisements
Similar presentations
Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Fast Algorithms For Hierarchical Range Histogram Constructions
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Arthur Chan Prepared for Advanced MT Seminar
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Evaluation of Machine Translation Systems: Metrics and Methodology Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon.
Evaluating Search Engine
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.
18th International Conference on Database and Expert Systems Applications Journey to the Centre of the Star: Various Ways of Finding Star Centers in Star.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.
Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:
Arthur Chan Prepared for Advanced MT Seminar
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Evaluating the Output of Machine Translation Systems
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin.
Evaluating the Output of Machine Translation Systems Alon Lavie Associate Research Professor, Carnegie Mellon University President, Safaba Translation.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
Machine Translation Course 10 Diana Trandab ă ț
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Machine Translation Course 9
Multi-Engine Machine Translation
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Linguistic Graph Similarity for News Sentence Searching
Vorlesung Maschinelle Übersetzung, SS 2010
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Web News Sentence Searching Using Linguistic Graph Similarity
Enhanced-alignment Measure for Binary Foreground Map Evaluation
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
Learning to Rank Typed Graph Walks: Local and Global Approaches
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Presented by: Anurag Paul
Week 6 Presentation Ngoc Ta Aidean Sharghi.
Presentation transcript:

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie Mellon University

METEOR Originally developed in 2005 as an automatic metric designed for higher correlation with human judgments at the sentence level Main ingredients:  Extended Matching between translation and reference  Unigram Precision, Recall  parameterized F-measure  Reordering Penalty  Parameters can be tuned to optimize correlation with human judgments Our previous work established improved correlation with human judgments (compared with BLEU and other metrics)  Not biased against “non-statistical” MT systems  Only metric that correctly ranked NIST MT Eval-06 Arabic systems  Was used as primary metric in DARPA TRANSTAC and ET-07 Evaluations, one of several metrics in NIST MT Eval, IWSLT, WMT Main innovation in latest version (used for WMT-08):  Ranking-based Parameter optimization June 19, 2008METEOR and M-BLEU2

METEOR – Flexible Word Matching Words in reference translation and MT hypothesis are matched using a series of modules with increasingly loose criteria of matching  Exact match  Porter Stemmer  Word Net based Synonymy A word-to-word alignment for the sentence pair is computed using these word level matchings  NP-hard in general, uses fast approximate search June 19, 20083METEOR and M-BLEU

Alignment Example The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister June 19, 20084METEOR and M-BLEU

Alignment Example The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister June 19, 20085METEOR and M-BLEU

Alignment Example The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister June 19, 20086METEOR and M-BLEU

Alignment Example The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by the country's prime minister June 19, 20087METEOR and M-BLEU

METEOR : Score Computation Weighted combination of the unigram precision and recall A fragmentation penalty to address fluency  A “chunk” is a monotonic sequence of aligned words Final Score June 19, 20088METEOR and M-BLEU

METEOR Parameter Tuning The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements.  Since the ranges of the parameters are bounded, we perform an exhaustive search Current official release of METEOR was tuned to obtain good correlations with adequacy and fluency human judgements For WMT-08, we re-tuned to optimize correlation with human ranking data released from last year's WMT shared task June 19, 20089METEOR and M-BLEU

Computing Ranking Correlation Convert binary judgements into full rankings  Reject equal judgements  Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements  Topologically sort the graph For one source sentence with N hypotheses Average across all source sentences June 19, METEOR and M-BLEU

Results 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation)‏ Results on the WMT 2008 data( % of correct binary judgments)‏ June 19, METEOR and M-BLEU

Flexible Matching for BLEU,TER The flexible matching in METEOR can be used to extend any metric that is based on word overlap between translation and reference(s)  Compute the alignment between reference and hypothesis using the METEOR matcher  Create a “targeted” reference by substituting words in the reference with their matched equivalences from the translation hypothesis.  Compute any metric (BLEU, TER, etc.) with the new targeted references June 19, METEOR and M-BLEU

M-BLEU : Results Average BLEU and M-BLEU scores in WMT 2008 data No consistent gains seen in correlations at the segment level on WMT 2007 data Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])‏ June 19, METEOR and M-BLEU

Discussion and Future Work METEOR has consistently demonstrated improved levels of correlation with human judgements in multiple evaluations in recent years  Simple and relatively fast to compute  Some minor issues for using it in MERT being resolved  Use it to evaluate your system, even if you tune to BLEU! Performance of MT metrics varies quite a lot across languages, genres and years  Partially due to no good methodology for evaluation of these metrics More sophisticated paraphrase detection methods (multi-word correspondences, such as compounds in German) would be useful for languages June 19, METEOR and M-BLEU

Thank You ! Questions ? June 19, METEOR and M-BLEU