Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.

Slides:



Advertisements
Similar presentations
Automatic methods of MT evaluation Practical 18/04/2005 MODL5003 Principles and applications of machine translation slides available at:
Advertisements

Statistical modelling of MT output corpora for Information Extraction.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Test Development.
Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.
Design of Experiments Lecture I
Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
ITEC 451 Network Design and Analysis. 2 You will Learn: (1) Specifying performance requirements Evaluating design alternatives Comparing two or more systems.
MODL5003 Principles and applications of MT
Determining How Costs Behave
The Basics of Regression continued
©2003 Prentice Hall Business Publishing, Cost Accounting 11/e, Horngren/Datar/Foster Determining How Costs Behave Chapter 10 2/07/05.
Exemplar-based accounts of “multiple system” phenomena in perceptual categorization R. M. Nosofsky and M. K. Johansen Presented by Chris Fagan.
Analysis of Individual Variables Descriptive – –Measures of Central Tendency Mean – Average score of distribution (1 st moment) Median – Middle score (50.
Chapter 7 Correlational Research Gay, Mills, and Airasian
1 Evaluating Psychological Tests. 2 Psychological testing Suffers a credibility problem within the eyes of general public Two main problems –Tests used.
Relationships Among Variables
Measurement and Data Quality
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Chapter 13: Inference in Regression
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
By : Garima Indurkhya Jay Parikh Shraddha Herlekar Vikrant Naik.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Learning Objective 1 Explain the two assumptions frequently used in cost-behavior estimation. Determining How Costs Behave – Chapter10.
User Study Evaluation Human-Computer Interaction.
1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi
A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.
Copyright  2003 by Dr. Gallimore, Wright State University Department of Biomedical, Industrial Engineering & Human Factors Engineering Human Factors Research.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Chapter 16 Data Analysis: Testing for Associations.
Presented By Dr / Said Said Elshama  Distinguish between validity and reliability.  Describe different evidences of validity.  Describe methods of.
Chapter 14 Correlation and Regression
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
NATIONAL CONFERENCE ON STUDENT ASSESSMENT JUNE 22, 2011 ORLANDO, FL.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
Reliability a measure is reliable if it gives the same information every time it is used. reliability is assessed by a number – typically a correlation.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
BPS - 5th Ed. Chapter 231 Inference for Regression.
Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.
Determining How Costs Behave
Centre for Translation Studies FACULTY OF ARTS
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Language Technologies Institute Carnegie Mellon University
Statistics: The Z score and the normal distribution
Intelligent Information System Lab
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
CHAPTER 29: Multiple Regression*
National Conference on Student Assessment
Using statistics to evaluate your test Gerard Seinhorst
Basic Practice of Statistics - 3rd Edition Inference for Regression
15.1 The Role of Statistics in the Research Process
REGRESSION ANALYSIS 11/28/2019.
Presentation transcript:

Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation Studies University of Leeds, UK BLEU vs. task-based evaluation methods

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 1 Overview Classification of automated MT evaluation models –Proximity-based vs. Task-based vs. Hybrid Some limitations of MT evaluation methods Sensitivity of automated evaluation metrics –Declining sensitivity as a limit Experiment: measuring sensitivity in different areas of the adequacy scale –BLEU vs. NE-recognition with GATE Discussion: can we explain/predict the limits?

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 2 Automated MT evaluation Metrics compute numerical scores that characterise certain aspects of machine translation quality Accuracy verified by the degree of agreement with human judgements –Possible only under certain restrictions by text type, genre, target language by granularity of units (sentence, text, corpus) by system characteristics SMT/ RBMT Used under other conditions: accuracy not assured Important to explore the limits of each metric

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 3 Classification of MT evaluation models Reference proximity methods (BLEU, Edit Distance) –Measuring distance between MT and a “gold standard” translation “…the closer the machine translation is to a professional human translation, the better it is” (Papineni et al., 2002) Task-based methods (X-score, IE from MT…) –Measuring performance of some system which uses degraded MT output: no need for reference “…can someone using the translation carry out the instructions as well as someone using the original?” (Hutchins & Somers, 1992: 163)

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 4 Task-based evaluation Metrics rely on the assumptions: –MT errors more frequently destroy contextual conditions which trigger rules –rarely create spurious contextual conditions Language redundancy: it is easier to destroy than to create E.g., (Rajman and Hartley 2001) X-score = (#RELSUBJ + #RELSUBJPASS – #PADJ – #ADVADJ) –sentential level (+) vs. local (–) dependencies –contextual difficulties for automatic tools are ~ proportional to relative “quality” (the amount of MT “degradation”)

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 5 Task-based evaluation with NE recognition NER system (ANNIE) the number of extracted Organisation Names gives an indication of Adequacy –ORI: … le chef de la diplomatie é gyptienne –HT: the Chief of the Egyptian Diplomatic Corps –MT-Systran: the chief of the Egyptian diplomacy

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 6 Task-based evaluation: number of NEs extracted from MT

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 7 Some limits of automated MT evaluation metrics Automated metrics useful if applied properly –E.g., BLEU: Works for monitoring system’s progress, but not for comparison of different systems doesn’t reliably compare systems built with different architectures (SMT, RBMT…) (Callison-Burch, Osborne and Koehn, 2006) –Low correlation with human scores on text/sent. level min corpus ~7,000 words for acceptable correlation not very useful for error analysis

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 8 … limits of evaluation metrics – beyond correlation High correlation with human judgements not enough –End users often need to predict human scores having computed automated scores (MT acceptable?) –Need regression parameters: Slope & Intercept of the fitted line Regression parameters for BLEU (and its weighted extension WNM) –are different for each Target Language & Domain / Text Type / Genre –BLEU needs re-calibration for each TL/Domain combination (Babych, Hartley and Elliott, 2005)

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 9 Sensitivity of automated evaluation metrics 2 dimensions not distinguished by the scores –A. there are stronger & weaker systems –B. there are easier & more difficult texts / sentences A desired feature of automated metrics (in dimension B): –To distinguish correctly the quality of different sections translated by the same MT system Sensitivity is the ability of a metric to predict human scores for different sections of evaluation corpus –easier sections receive higher human scores –can the metric also consistently rate them higher?

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 10 Sensitivity of automated metrics – research problems Are the dimensions A and B independent? Or does the sensitivity (dimension B) depend on the overall quality of an MT system (dimension A) ? –(does sensitivity change in different areas of the quality scale) Ideally automated metrics should have homogeneous sensitivity across the entire human quality scale –for any automatic metric we would like to minimise such dependence

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 11 Varying sensitivity as a possible limit of automated metrics If sensitivity declines at a certain area on the scale, automated scores become less meaningful / reliable there –For comparing easy / difficult segments generated by the same MT system –But also – for distinguishing between systems at that area (metric agnostic about source): [0… 0.5…1](human scores) More reliable…Less reliable comparison

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 12 Experiment set-up: dependency between Sensitivity & Quality Stage 1: Computing approximated sensitivity for each system –BLEU scores for each text correlated with human scores for the same text Stage 2: Observing the dependency between the sensitivity and systems’ quality –sensitivity scores for each system (from Stage 1) correlated with ave. human scores for the system Repeating the experiment for 2 types of automated metrics –Reference proximity-based (BLEU) –Task-based (GATE NE recognition)

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 13 Stage 1: Measuring sensitivity of automated metrics Task: to cover different areas on adequacy scale –We use a range of systems with different human scores for Adequacy –DARPA-94 corpus: 4 systems (1 SMT, 3 RBMT) + 1 human translation, 100 texts with human scores For each system the sensitivity is approximated as: –r-correlation between BLEU / GATE and human scores for 100 texts

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 14 Stage 2: capturing dependencies: system’s quality and sensitivity The sensitivity may depend on the overall quality of the system –is there such tendency? System-level correlation between –sensitivity (text-level correlation figures for each system) –and its average human scores Strong correlation not desirable here: –E.g., strong negative correlation: automated metric looses sensitivity for better systems –Weak correlation: metric’s sensitivity doesn’t depend on systems’ quality

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 15 Compact description of experiment set-up Formula describes the order of experimental stages Computation or data + arguments in brackets (in enumerator & denominator) Capital letters = independent variables Lower-case letters = fixed parameters

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 16 Results R-correlation on the system level lower for NE-Gate BLEU outperforms GATE –But correlation is not the only characteristic feature of a metric …

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 17 Results Sensitivity of BLEU is much more dependent on MT quality –BLEU is less sensitive for higher quality systems

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 18 Results (contd.)

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 19 Discussion Reference proximity metrics use structural models –Non-sensitive to errors on higher level (better MT) –Optimal correlation for certain error types Task-based metrics use functional models –Potentially can capture degradation at any level –E.g., better capture legitimate variation [MorphosyntacticLexicalTextual cohesion/coherence] Long-distance dependencies Textual function Performance -based metrics … loose sensitivity for higher errors Reference- proximity metrics

29 May 2008LREC 2008 Sensitivity of BLEU vs task-based evaluation 20 Conclusions and future work Sensitivity can be one of limitation of automated MT evaluation metrics: –Influences reliability of predictions at certain quality level Functional models which work on textual level –can reduce the dependence of metrics’ sensitivity on systems’ quality Way forward: developing task-based metrics using more adequate functional models –E.g., non-local information (models for textual coherence and cohesion...)