Presentation is loading. Please wait.

Presentation is loading. Please wait.

BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.

Similar presentations


Presentation on theme: "BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar."— Presentation transcript:

1 BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar

2 This Talk  Original BLEU scores (Papineni 2002) Motivation Procedure  NIST: as a major BLEU variant  Critics of BLEU From alternate evaluation metrics  METEOR: (Lavie 2004, Banerjee 2005) From analysis of BLEU (Culy 2002)  METEOR will be covered by Alon (next talk)

3 Bilingual Evaluation Understudy (BLEU)

4 Motivation of Automatic Evaluation in MT  Human evaluations of MT weigh many aspects such as Adequacy Fidelity Fluency  Human evaluation are expensive  Human evaluation could take a long time While system need daily change  Good automatic evaluation could save human

5 BLEU – Why is it Important?  Some reasons: It is proposed by IBM  IBM has a long history of proposing evaluation standards Verified and Improved by NIST  So, its variant is used in evaluation Widely used  Appear everywhere in MT literature after 2001 It is quite useful  does give good feedback to the adequacy and fluency for translation results It is not perfect  It is a subject of criticism (the critics make some sense in this case)  It is a subject of extension

6 BLEU – Its Motivation  Central Idea: “The closer a machine translation is to a professional human translation, the better it is.”  Implication A evaluation metric could be evaluated  If it correlates with human evaluation, it would be a useful metric  BLEU was proposed as an aid as a quick substitute of humans when needed

7 BLEU – What is it? A Big Picture  Require multiple good reference translations  Depends on modified n-gram precision (or co-occurrence) Co-occurrence: if translated sentence hit n- gram in any reference sentences  Per-corpus n-gram co-occurrence is computed  n can has several values and a weighted sum is computed  Brevity of translation is penalized

8 BLEU – N-gram Precision: a Motivating Example Candidate 1: It is a guide to action which ensures that the military always obey the commands the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party.

9 BLEU – Modified N-gram Precision  Issues with N-gram precision Give a very good score for over generated n-gram

10 BLEU – Brevity Penalty

11 BLEU – The “Trouble” with Recall

12 BLEU – Recall and Brevity Penalty

13 BLEU – Paradigm of Evaluation

14 BLEU – Evaluation of the Metric

15 BLEU – The Human Evaluation

16 BLEU – BLEU vs Human Evaluation

17 NIST – As a BLEU’s Variant

18 Usage of BLEU on Character-based Language

19 Critics of BLEU – From Analysis of BLEU

20 Critics of BLEU – A Glance of Metrics Beyond BLEU

21 Critics of BLEU – Summary of BLEU’s Issues

22 Discussion - Should BLEU be the Standard Metric of MT?

23 References  Kishore Panineni, Salim Roukos, Todd Ward and Wei Jing Zhu, BLEU, a Method for Automatic Evaluation of Machine Translation. In ACL-02. 2002  George Doddington, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics.  Etiene Denoual, Yves Lepage, BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters.  Alon Lavie, Kenji Sagae, Shyamsundar Jayaraman, The Significance of Recall in Automatic Metrics for MT Evaluation.  Christopher Culy, Susanne Z. Riechemann, The Limits of N- Gram Translation Evaluation Metrics.  Santanjeev Banerjee, Alon Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.


Download ppt "BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar."

Similar presentations


Ads by Google