Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

Slides:

Advertisements

Similar presentations

Statistical modelling of MT output corpora for Information Extraction.

Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Chapter 12 Inference for Linear Regression

Arthur Chan Prepared for Advanced MT Seminar

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Overview What is Dynamic Programming? A Sequence of 4 Steps

Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.

MT Evaluation CA446 Week 9 Andy Way School of Computing Dublin City University, Dublin 9, Ireland

Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

Improving Word-Alignments for Machine Translation Using Phrase-Based Techniques Mike Rodgers Sarah Spikes Ilya Sherman.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Arthur Chan Prepared for Advanced MT Seminar

Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

INSTITUTE OF COMPUTING TECHNOLOGY Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Experimental Evaluation of Learning Algorithms Part 1.

How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.

© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

1 Sentence-extractive automatic speech summarization and evaluation techniques Makoto Hirohata, Yosuke Shinnaka, Koji Iwano, Sadaoki Furui Presented by.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Korea Maritime and Ocean University NLP Jung Tae LEE

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

Chapter 23: Probabilistic Language Models April 13, 2004.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Estimating N-gram Probabilities Language Modeling.

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Some Alternative Approaches Two Samples. Outline Scales of measurement may narrow down our options, but the choice of final analysis is up to the researcher.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Machine Translation Course 10 Diana Trandab ă ț

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

Machine Translation Course 9

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Vorlesung Maschinelle Übersetzung, SS 2010

LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)

PAE-DIRT Paraphrase Acquisition Enhancing DIRT

Improved Word Alignments Using the Web as a Corpus

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Presentation transcript:

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a Method for Automatically Evaluating Automatic Evaluation Metrics for Machine Translation

Summary 1. Introduces ORANGE, a way to automatically evaluate automatic evaluation methods 2. Introduces 3 new ways to automatically evaluate MT systems: ROUGE-L, ROUGE-W, and ROUGE-S 3. Uses ORANGE to evaluate many different evaluation methods, and finds that their new one, ROUGE-S4 is the best evaluator

Reminder: Adequacy & Fluency Adequacy refers to the degree to which the translation communicates information present in the original. Roughly, a translation using the same words (1-grams) as the reference tends to satisfy adequecy. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. Roughly, the longer the n-gram matches of a translation with a reference, tends to improve fluency.

Reminder: BLEU unigram precision = num translation unigrams that appear in reference translation/candidate translation length modified unigram precision = clipped(num trans. unigrams in ref translation)/cand. trans. length –clipping maxes out to the max count in any ref. translation modified n-gram precision (same thing) On blocks of text: Brevity penalty (since short sentences can get high low- gram precision) Finally:

Still Other Reminders Person’s product moment correlation coefficient: –just the normal correlation coefficient r 2 = (EXY) 2 /EX 2 EY 2 Spearman’s rank order correlation coefficient –same thing for normal orders, or otherwise: –D i = rank A (i) – rank B (i) Bootstrap method to compute confidence intervals –resample with replacement from data N times, compute mean and get confidence interval of val +/- 2se(val) for 95% confidence interval.

On to the paper: Lots of ways to evaluate MT quality BLEU RED WER – length-normalized edit distance PER – position independent word error rate (bag of words approach) GTM – general text matcher, based on a balance of recall, precision, & their F-measure combination (we should do this paper) This paper now introduces still three more such metrics: ROUGE-L, ROUGE-W, and ROUGE-S (which we shall define).

Corr Coeff & 95% CIs of 8 MT systems in NIST03 Chinese-English, using various MT evaluation methods

Problem is we need a way to automatically evaluate these automatic evaluation methods. Since we don’t know which one is best, which one to use, how and when to choose, etc. Try to break out of the region of insignificant difference. Question: what about meta regress: do we need a way to automatically evaluate automatic evaluations of automatic evaluation methods? Anyway, goal of this paper (other than introducing new automatic evaluation methods) is to introduce ORANGE: Oracle Ranking for Gisting Evaluation (or the first automatic evaluation of automatic MT evaluation methods).

ORANGE Intuitively: uses translations “rank” as scored by MT evaluation system (so good translations should have high rank, poor ones should have low rank) reference translations should have higher rank. Key quantity: average rank of reference translations within combined machine and reference translation list. ORANGE = average rank / N in N-best list. 1.The bank was visited by me yesterday. 2.I went to the bank yesterday 3.Yesterday, I went to the bank. 4.Yesterday, the bank had the opportunity to be visited by me, and in fact this did indeed occur. 5.There was once this back that at least as of yesterday existed, and so did I, and a funny thing happened … So, reference translations were ranked 2 and 3 in this list. Avg rank = 2.5. Smaller the better.

ORANGE The way they calculate ORANGE in this work: Oracle i = reference transcription I N = size of N-best list S = number of sentences in corpus Rank(Oracle i ) = average rank of source sentence i’s reference translations in n-best list i.

Three new metrics ROUGE-L ROUGE-W ROUGE-S

LCS: Longest Common Subsequences

Computing: Longest Common Subsequences 1 Key thing: This does not require consecutive matches in strings. Ex: LCS(X,Y) = 3 - police killed the gunman - police kill the gunman

ROUGE-L Basically, an “F-measure” (or combination) of two normalized LCSs when 1.Again, no consecutive matches necessary 2.automatically includes longest in-sequence common n-gram.

ROUGE-L Reference two candidates ROUGE-L = 3/5 ROUGE-L = 1/2

ROUGE-L Basically, an “F-measure” (or combination) of two normalized LCSs when 1.Again, no consecutive matches necessary 2.automatically includes longest in-sequence common n-gram. 3.problem: counts only main in-sequence words, other LCSs and shorter CSs are not counted

Computing: ROUGE-W score so that consecutive matches should be awarded more than non-consecutive matches.

ROUGE-S Another “F-measure” but here using skip-bigram co-occurance statistics (i.e., non-consecutive but same order bi-grams). Goal is to measure overlap of skip-bigrams. We use function SKIP2(X,Y) to measure number of common skip-bigrams in X and Y.

ROUGE-S Using the SKIP2() function: No consecutive matches required, but still respects word order counts *all* in-order matching word pairs (LCS only counts longest common subsequence) Can impose limit on max skip distance –ROUGE-Sn, has max skip distance of n (e.g., ROUGE-S4)

Setup ISI’s A1Temp System 2002 NIST Chinese-English evaluation corpus 872 source sentences, 4 ref trans. each 1024-best lists used

Evaluating BLEU with ORANGE smoothed BLEU:

Evaluating BLEU with CC

Evaluating ROUGE-L/W with

Evaluating ROUGE-S with

Summary of metrics with