Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.

Slides:



Advertisements
Similar presentations
Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
Advertisements

Can we fight Terrorism with Force? Braunwarth. Some Pre-War Claims Saddam Hussein was responsible for 9/11 Saddam Hussein posed a direct threat to the.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Arthur Chan Prepared for Advanced MT Seminar
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
LESSON 1: What is Genetic Research? PowerPoint slides to accompany Using Bioinformatics : Genetic Research.
Machine Translation- 4 Autumn 2008 Lecture Sep 2008.
Machine Translation Introduction to Statistical MT.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
MT Evaluation CA446 Week 9 Andy Way School of Computing Dublin City University, Dublin 9, Ireland
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
The Impact of Oil and Middle East Wars Lesson 20.
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
CSCI 5582 Artificial Intelligence
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Democracy and Political information. On a scrap of paper, answer the following questions. (Put a question mark if you don’t know the answer….) 1. What.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 26 Jim Martin.
Revising Your Essays Common Issues and How to Solve Them.
The Fight Against Terrorism
Terrorist Attacks Against the United States:
Machine Translation- 5 Autumn 2008 Lecture Sep 2008.
Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 16 5 September 2007.
Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:
THE WAR ON TERRORISM Sec Pages September 11, 2001 Prime suspect, Osama bin Laden Muslim – someone who believed in and practices the religion.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Arthur Chan Prepared for Advanced MT Seminar
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
SMT – Final thoughts Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some.
Kyoshiro SUGIYAMA, AHC-Lab., NAIST An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering Kyoshiro Sugiyama, Masahiro.
Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop.
Machine Translation Course 5 Diana Trandab ă ț Academic year:
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
THE TERRORIST CHALLENGE September 11. The terrible events of September 11, 2001, “changed everything.”
Boris Milašinović Faculty of Electrical Engineering and Computing University of Zagreb, Croatia.
September 11, 2001 By Wadnel Joly. 9/11 On September 11, planes were hijacked by terrorist and used as weapons against American people On September.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Summary Barack Obama is talking about Osama Bin Laden being killed. Where they found him at and how they planned to kill him. He also talks about the.
2 Sections: 90 Minutes Each (50% Grade each Section)
History of al-Qaeda and Terrorism against the USA.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
The Impact of Oil and Middle East Wars Lesson 20.
TEKS 8C: Calculate percent composition and empirical and molecular formulas. Terrorism and Global Security.
TERRORISM Domestic and Foreign. METHODS OF TERRORISM Kidnappings Murder and rape Firearms Explosives Explosives Chemical weapons Chemical weapons Biological.
DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)
Do-First Review Foreign Policy Notes. IE: 4 Major Reasons for US Involvement in affairs of other countries 1)Why does the United States get involved in.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Computing & Information Sciences Kansas State University Friday, 05 Dec 2008CIS 530 / 730: Artificial Intelligence Lecture 39 of 42 Friday, 05 December.
Afghanistan Mr. McDuffie World Geography Spring 2008.
NLP.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Causes of Terrorism By Shahid Umar Powered by: FuturenoteZ.com
America, terrorism, & oil
CSCI 5832 Natural Language Processing
Objectives Explain why nuclear, biological, and chemical weapons threaten global security. Analyze the various terrorist groups and why they are becoming.
SMT – Final thoughts David Kauchak CS159 – Spring 2019
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Presentation transcript:

Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology

Automatic Evaluation ● Why automatic evaluation metrics? – Manual evaluation is too slow – Evaluation on large test sets reveals minor improvements – Automatic tuning to improve machine translation performance ● History – Word Error Rate – BLEU since 2002 ● BLEU in short: Overlap with reference translations

BLEU in Action the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 What is the best translation?

BLEU in Action the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 green = 4-gram match (good!) cyan = 3-gram match blue= 2-gram match purple= 1-gram match red = word not matched (bad!)

Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance.

DARPA MT Evaluation Corpus 11 Human Translations of 100 Chinese News Article At least 12 people were killed in the battle last week. Last week 's fight took at least 12 lives. The fighting last week killed at least 12. The battle of last week killed at least 12 persons. At least 12 people lost their lives in last week 's fighting. At least 12 persons died in the fighting last week. At least 12 died in the battle last week. At least 12 people were killed in the fighting last week. During last week 's fighting, at least 12 people died. Last week at least twelve people died in the fighting. Last week 's fighting took the lives of twelve people.

BLEU in Theory ● How many n-grams in the output matchn-grams in the reference ? ● Usually 1-gram to 4-grams ● Length penalty to assure that output is of similar length ● BLEU = BP * exp(w1 * log p w4 * log p4) ● pn = correct n-grams / count n-grams in output ● BP = min(1, exp(length_output/length_reference) )

BLEU Tends to Predict Human Judgments slide from G. Doddington (NIST) (variant of BLEU)

Developing with BLEU Track improvements – quit dead ends early

Optimize Systems for BLEU Translation System (Automatic, Trainable) Translation Quality Evaluator (Automatic) Foreign English MT Output English Reference Translations (sample “right answers”) BLEU score Learning algorithm for directly reducing translation error  big improvements in quality.

Criticisms of BLEU ● Not sensitive to global syntactic structure ● Some words are more important than others (“not” vs. “the”) ● Score by itself is not very meaningful (is 0.34 good?)... but does this matter?... can it be fixed?

Is BLEU perfect? ● A very useful tool at this point ● Some caveats – Only makes sense for large test sets (1000s sentences) – BLEU does not work for single sentences ● Problems with BLEU have to be demonstrated by lack of correlation with human jugdements Nobody cares about anecdotal criticism ● Can BLEU be improved? There is a lot of work in MT Evaluation...