Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology
Automatic Evaluation ● Why automatic evaluation metrics? – Manual evaluation is too slow – Evaluation on large test sets reveals minor improvements – Automatic tuning to improve machine translation performance ● History – Word Error Rate – BLEU since 2002 ● BLEU in short: Overlap with reference translations
BLEU in Action the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 What is the best translation?
BLEU in Action the gunman was shot to death by the police. (Reference Translation) the gunman was police kill. #1 wounded police jaya of #2 the gunman was shot dead by the police. #3 the gunman arrested by police kill. #4 the gunmen were killed. #5 the gunman was shot to death by the police. #6 gunmen were killed by police ?SUB>0 ?SUB>0 #7 al by the police. #8 the ringer is killed by the police. #9 police killed the gunman. #10 green = 4-gram match (good!) cyan = 3-gram match blue= 2-gram match purple= 1-gram match red = word not matched (bad!)
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance. Multiple Reference Translations Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Reference translation 3: The US International Airport of Guam and its office has received an from a self-claimed Arabian millionaire named Laden, which threatens to launch a biochemical attack on such public places as airport. Guam authority has been on alert. Reference translation 4: US Guam International Airport and its office received an from Mr. Bin Laden and other rich businessman from Saudi Arabia. They said there would be biochemistry air raid to Guam Airport and other public places. Guam needs to be in high precaution about this matter. Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places. Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail, which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack, [?] highly alerts after the maintenance.
DARPA MT Evaluation Corpus 11 Human Translations of 100 Chinese News Article At least 12 people were killed in the battle last week. Last week 's fight took at least 12 lives. The fighting last week killed at least 12. The battle of last week killed at least 12 persons. At least 12 people lost their lives in last week 's fighting. At least 12 persons died in the fighting last week. At least 12 died in the battle last week. At least 12 people were killed in the fighting last week. During last week 's fighting, at least 12 people died. Last week at least twelve people died in the fighting. Last week 's fighting took the lives of twelve people.
BLEU in Theory ● How many n-grams in the output matchn-grams in the reference ? ● Usually 1-gram to 4-grams ● Length penalty to assure that output is of similar length ● BLEU = BP * exp(w1 * log p w4 * log p4) ● pn = correct n-grams / count n-grams in output ● BP = min(1, exp(length_output/length_reference) )
BLEU Tends to Predict Human Judgments slide from G. Doddington (NIST) (variant of BLEU)
Developing with BLEU Track improvements – quit dead ends early
Optimize Systems for BLEU Translation System (Automatic, Trainable) Translation Quality Evaluator (Automatic) Foreign English MT Output English Reference Translations (sample “right answers”) BLEU score Learning algorithm for directly reducing translation error big improvements in quality.
Criticisms of BLEU ● Not sensitive to global syntactic structure ● Some words are more important than others (“not” vs. “the”) ● Score by itself is not very meaningful (is 0.34 good?)... but does this matter?... can it be fixed?
Is BLEU perfect? ● A very useful tool at this point ● Some caveats – Only makes sense for large test sets (1000s sentences) – BLEU does not work for single sentences ● Problems with BLEU have to be demonstrated by lack of correlation with human jugdements Nobody cares about anecdotal criticism ● Can BLEU be improved? There is a lot of work in MT Evaluation...