BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

Slides:

Advertisements

Similar presentations

Statistical modelling of MT output corpora for Information Extraction.

Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.

Arthur Chan Prepared for Advanced MT Seminar

Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.

Machine Translation- 4 Autumn 2008 Lecture Sep 2008.

Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.

Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.

MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.

TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.

Dependency-Based Automatic Evaluation for Machine Translation Karolina Owczarzak, Josef van Genabith, Andy Way National Centre for Language Technology.

Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.

Evaluation of Machine Translation Systems: Metrics and Methodology Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon.

June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.

CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Dorr MT (continued), MT Evaluation Prof. Bonnie J. Dorr Dr. Christof Monz TA:

CSCI 5582 Artificial Intelligence

Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams Instructors: Alon Lavie Stephan Vogel.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.

Machine Translation- 5 Autumn 2008 Lecture Sep 2008.

Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.

Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.

“ Poetry is what gets lost in translation.” Robert Frost Poet (1874 – 1963) Wrote the famous poem ‘Stopping by woods on a snowy evening’ better known as.

Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Arthur Chan Prepared for Advanced MT Seminar

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.

Evaluating the Output of Machine Translation Systems

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

Korea Maritime and Ocean University NLP Jung Tae LEE

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

Modern MT Systems and the Myth of Human Translation: Real World Status Quo ● Intro ● MT & HT Definitions ● Comparison MT vs. HT ● Evaluation Methods ●

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

NLP. Machine Translation Tree-to-tree – Yamada and Knight Phrase-based – Och and Ney Syntax-based – Och et al. Alignment templates – Och and Ney.

24 January 2016© P C F de Oliveira Evaluating Summaries Automatically – a system proposal Paulo C. F. de Oliveira, Edson Wilson Torrens, Alexandre.

© 2010 IBM Corporation Learning to Predict Readability using Diverse Linguistic Features Rohit J. Kate, Xiaoqiang Luo, Siddharth Patwardhan, Martin Franz,

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin.

Evaluating the Output of Machine Translation Systems Alon Lavie Associate Research Professor, Carnegie Mellon University President, Safaba Translation.

Semantic Evaluation of Machine Translation Billy Wong, City University of Hong Kong 21 st May 2010.

A method to restrict the blow-up of hypotheses... A method to restrict the blow-up of hypotheses of a non-disambiguated shallow machine translation system.

MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.

Machine Translation Course 10 Diana Trandab ă ț

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

COMMON CORE STANDARDS C OLLEGE - AND C AREER - READINESS S TANDARDS North East Florida Educational ConsortiumFall 2011 F LORIDA ’ S P LAN FOR I MPLEMENTATION.

An Improved Hierarchical Word Sequence Language Model Using Word Association NARA Institute of Science and Technology Xiaoyi WuYuji Matsumoto.

DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)

Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.

Tight Coupling between ASR and MT in Speech-to-Speech Translation Arthur Chan Prepared for Advanced Machine Translation Seminar.

Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.

Machine Translation Course 9

METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.

Vorlesung Maschinelle Übersetzung, SS 2010

Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.

ترجمه ماشینی مبتنی بر آنتولوژی

Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.

Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Presentation transcript:

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar

This Talk  Original BLEU scores (Papineni 2002) Motivation Procedure  NIST: as a major BLEU variant  Critics of BLEU From alternate evaluation metrics  METEOR: (Lavie 2004, Banerjee 2005) From analysis of BLEU (Culy 2002)  METEOR will be covered by Alon (next talk)

Bilingual Evaluation Understudy (BLEU)

Motivation of Automatic Evaluation in MT  Human evaluations of MT weigh many aspects such as Adequacy Fidelity Fluency  Human evaluation are expensive  Human evaluation could take a long time While system need daily change  Good automatic evaluation could save human

BLEU – Why is it Important?  Some reasons: It is proposed by IBM  IBM has a long history of proposing evaluation standards Verified and Improved by NIST  So, its variant is used in evaluation Widely used  Appear everywhere in MT literature after 2001 It is quite useful  does give good feedback to the adequacy and fluency for translation results It is not perfect  It is a subject of criticism (the critics make some sense in this case)  It is a subject of extension

BLEU – Its Motivation  Central Idea: “The closer a machine translation is to a professional human translation, the better it is.”  Implication A evaluation metric could be evaluated  If it correlates with human evaluation, it would be a useful metric  BLEU was proposed as an aid as a quick substitute of humans when needed

BLEU – What is it? A Big Picture  Require multiple good reference translations  Depends on modified n-gram precision (or co-occurrence) Co-occurrence: if translated sentence hit n- gram in any reference sentences  Per-corpus n-gram co-occurrence is computed  n can has several values and a weighted sum is computed  Brevity of translation is penalized

BLEU – N-gram Precision: a Motivating Example Candidate 1: It is a guide to action which ensures that the military always obey the commands the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party.

BLEU – Modified N-gram Precision  Issues with N-gram precision Give a very good score for over generated n-gram

BLEU – Brevity Penalty

BLEU – The “Trouble” with Recall

BLEU – Recall and Brevity Penalty

BLEU – Paradigm of Evaluation

BLEU – Evaluation of the Metric

BLEU – The Human Evaluation

BLEU – BLEU vs Human Evaluation

NIST – As a BLEU’s Variant

Usage of BLEU on Character-based Language

Critics of BLEU – From Analysis of BLEU

Critics of BLEU – A Glance of Metrics Beyond BLEU

Critics of BLEU – Summary of BLEU’s Issues

Discussion - Should BLEU be the Standard Metric of MT?

References  Kishore Panineni, Salim Roukos, Todd Ward and Wei Jing Zhu, BLEU, a Method for Automatic Evaluation of Machine Translation. In ACL  George Doddington, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics.  Etiene Denoual, Yves Lepage, BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters.  Alon Lavie, Kenji Sagae, Shyamsundar Jayaraman, The Significance of Recall in Automatic Metrics for MT Evaluation.  Christopher Culy, Susanne Z. Riechemann, The Limits of N- Gram Translation Evaluation Metrics.  Santanjeev Banerjee, Alon Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.