A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Statistical modelling of MT output corpora for Information Extraction.
Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation.
Arthur Chan Prepared for Advanced MT Seminar
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Recent Trends in MT Evaluation: Linguistic Information and Machine Learning Jason Adams Instructors: Alon Lavie Stephan Vogel.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.
Name Extraction from Chinese Novels CS224n Spring 2008 Jing Chen and Raylene Yung.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Machine translation Context-based approach Lucia Otoyo.
Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Automated Metrics for MT Evaluation : Machine Translation Alon Lavie March 2, 2011.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Arthur Chan Prepared for Advanced MT Seminar
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
NTU & MSRA Ming-Feng Tsai
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Machine Translation Course 10 Diana Trandab ă ț
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Machine Translation Course 9
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Vorlesung Maschinelle Übersetzung, SS 2010
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Build MT systems with Moses
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Presentation transcript:

A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati

Outline  BLEU and ROUGE metric families  BLANC –family of adaptable metrics All common skip n-grams Local n-gram model Overall model  Experiments and results  Conclusions  Future work  References

Automatic Evaluation Metrics  Manual human judgments  Edit distance ( WER )  Word overlap ( PER )  Metrics based on n-grams n-gram precision (BLEU) weighted n-grams (NIST) longest common subsequence (Rouge-L) skip 2-grams (pairs of ordered words – Rouge-S)  Integrate additional knowledge (synonyms, stemming) (METEOR) t i m e translation quality ( candidate | reference )

Automatic Evaluation Metrics  Manual human judgments  Machine translation (MT) evaluation metrics Manually created estimators of quality Improvements often shown on the same data Rigid notion of quality Based on existing judgment guidelines  Goal: trainable evaluation metric t i m e translation quality ( candidate | reference )

Goal: Trainable MT Metric  Build on the features used by established metrics ( BLEU, ROUGE )  Extendable – additional features/processing  Correlate well with human judgments  Trainable models Different notions of “translation quality”  E.g. computer consumption vs. human consumption Different features will be more important for different  Languages  Domains

The WER Metric R: the students asked the professor C: the students talk professor Word Error Rate = # of word insertions, deletions, and substitutions # words in R  Transform reference (human) translation R into candidate (machine) translation C Levenshtein (edit) distance

The PER Metric  Word overlap between candidate (machine) translation C and reference (human) translation R Bag of words Position Independent Error Rate  | count of w in R – count of w in C | # words in R  R: the students asked the professor C: the students talk professor w in C

The BLEU Metric Modified n-gram precisions  1-gram precision = 3 / 4  2-gram precision = 1 / 3  …  Contiguous n-gram overlap between reference (human) translation R and candidate (machine) translation C R: the students asked the professor C: the students talk professor BLEU = (   P i-gram ) 1/n * ( brevity penalty ) i = 1 n

The BLEU Metric  BLEU is the most established evaluation metric in MT  Basic feature: contiguous n-grams of all sizes  Computes modified precision  Uses a simple formula to combine all precision scores  Bigram precision is “as important” as unigram precision  Brevity penalty – quasi recall

The Rouge-L Metric R: the students asked the professor C: the students talk professor  Longest common subsequence (LCS) of the candidate (machine) translation C and reference (human) translation R LCS = 3 “the students … professor” Precision LCS ( C,R ) # words in C == Recall LCS ( C,R ) # words in R == Rouge-L = harmonic mean (Precision, Recall) = 2PR / (P+R)

The Rouge-S Metric R: the students asked the professor C: the students talk professor  Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R Skip 2 ( C ) = 6 { “the students”, “the talk”, “the professor”, “students talk”, “students professor”, “talk professor” } Skip 2 ( C,R ) = 3 { “the students”, “the professor”, “students professor” } 11

The Rouge-S Metric R: the students asked the professor C: the students talk professor  Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R Precision Skip 2 ( C,R ) |C| choose 2 == Recall Skip 2 ( C,R ) |R| choose 2 == Rouge-S = harmonic mean (Precision, Recall)

The ROUGE Metrics  Rouge-L Basic feature: longest common subsequence LCS  Size of the longest common skip n-gram Weighted LCS  Rouge-S Basic feature: skip bigrams Skip bigram gap size irrelevant Limited to n-grams of size 2  Both use harmonic mean (F1-measure) to combine precision and recall

Is BLEU Trainable?  Can we assign/learn relative importance between P 2 and P 3 ?  Simplest model: regression Train/test on past MT output [C,R] Inputs: P 1, P 2, P 2 … and brevity penalty  P 1, P 2, P 2, b p  HJ fluency score BLEU = (   P i-gram ) 1/n * ( brevity penalty ) i = 1 n

Is Rouge Trainable?  Simple regression on Size of the longest common skip n-gram Number of common skip 2-grams  Second order parameters (dependencies) – model is not linear in its inputs anymore Window size (computation reasons) F-measure to  F (replacing brevity penalty)  Potential models Iterative methods Hill climbing?  Non-linear (B p, |LCS|, Skip 2,  F, ws)  HJ fluency score

The BLANC Metric Family  Generalization of established evaluation metrics N-gram features used by BLEU and ROUGE  Trainable parameters Skip n-gram contiguity in C Relative importance of n (i.e. bigrams vs. trigrams) Precision-recall balance  Adaptability to different: Translation quality criteria, languages, domains  Allow additional processing/features (e.g. METEOR matching)

All Common Skip N-grams C: the one pure student brought the necessary condiments R: the new student brought the food C: the one pure student brought the necessary condiments R: the new student brought the food (,,, ) # 1grams: 4 # 2grams: 6 # 3grams: 4 # 4grams: 1 the(0,0) the(4,5) student(2,3)brought(3,4) the(0,5) the(4,0)

All Common Skip N-grams C: the one pure student brought the necessary condiments R: the new student brought the food C: the one pure student brought the necessary condiments R: the new student brought the food the(0,0) the(4,5) student(2,3)brought(3,4) (,,, ) s 22 s ? score(1-grams) score(2-grams) score(3-grams) score(4-grams)  score(the 0,0,student 2,3 )   ’’  ’’

All Common Skip N-grams  Algorithms literature: all common subsequences  Listing vs. counting subsequences  Interested in counting # common subsequences of size 1, 2, 3 …  Replace counting with  score over all n-grams of the same size Score(w 1 …w i,w i+1 …w n ) = Score(w 1 …w i )  Score(w 1+1 …w n )  BLANC i (C,R) = f(common i-grams of C,R)

Modeling Gap Size Importance skip 3-grams … the ____ ____ ____ ____ student ____ ____ has … … the ____ student has … … the student has …

Modeling Gap Size Importance  Model the importance of skip n-gram gap size as an exponential function with one parameter (  )  Special cases Gap size doesn’t matter (Rouge-S):  = 0 No gaps are allowed (BLEU):  = large number C: … the __ __ __ __ student __ __ has …

Modeling Candidate-Reference Gap Difference skip 3-gram match C 1 : … the ____ ____ ____ ____ student ____ ____ has … R: … the ____ student has … C 2 : … the student has …

Modeling Candidate-Reference Gap Difference  Model the importance of gap size difference between the candidate and reference translations as an exponential function with one parameter (  )  Special cases Gap size differences do not matter:  = 0 Skip 2-gram overlap (Rouge-S):  = 0,  = 0, n=2 Largest skip n-gram (Rouge-L):  = 0,  = 0, n=LCS C: … the __ __ __ __ student __ __ has … R: … the __ student has …

Skip N-gram Model  Incorporate simple scores into an exponential model Skip n-gram gap size Candidate-reference gap size difference  Possible to incorporate higher level features Partial skip n-grams matching (e.g. synonyms, stemming)  “the __ students” vs. “the __ pupils”, “the __ students” vs. “the __ student” From word classing to syntax  e.g. score( “students __ __ professor”) ? score (“the __ __ of”)

Candidates References Find Common Skip Ngram Find All Common Skip Ngrams Compute Skip Ngram Pair Features e -  i  f i (sn) Combine All Common Skip Ngram Scores Global parameters precision/recall f(skip ngram size) Compute Correlation Coefficient pearson spearman Criterion adequacy fluency f(adequacy, fluency) other Trained Metric BLANC Overview

Incorporating Global Features  Compute BLANC precision and recall for each n- gram size i  Global exponential model based on N-gram size: I  BLANC i (C,R) i=1..n F-measure parameter F  for each size i Average reference segment size Other scores (i.e. BLEU, ROUGE-L, ROUGE-S) …  Train for average human judgment vs. train for best overall correlation (as the error function)

Experiment Setup  Tides evaluation data Arabic  English 2003, 2004  Training and test sentences separated by year  Optimized: n-gram contiguity difference in gap size (C vs. R) Balance between precision and recall  Correlation using the Pearson correlation coefficient  Compared BLANC to BLEU and ROUGE  Trained BLANC for Fluency vs. adequacy System level vs. sentence level

Tides 2003 Arabic Evaluation System LevelSentence Level Method AdequacyFluencyAdequacyFluency BLEU NIST Rouge-L Rouge-S BLANC  Pearson [-1,1] correlation with human judgments at system level and sentence level

Tides 2004 Arabic Evaluation System LevelSentence Level Method AdequacyFluencyAdequacyFluency BLEU NIST Rouge-L Rouge-S BLANC  Pearson [-1,1] correlation with human judgments at system level and sentence level

Advantages of BLANC  Consistently good performance  Candidate evaluation is fast  Adaptable fluency and adequacy languages, domains  Help train MT systems for specific tasks e.g. information extraction, information retrieval  Model complexity  Can be optimized for specific MT system performance levels

Disadvantages of BLANC  Training data vs. number of parameters  Model complexity  Guarantees of the training process

Conclusions  Move towards learning evaluation metrics Quality criteria – e.g. fluency, adequacy Correlation coefficients – e.g. Pearson, Spearman Languages – e.g. English, Arabic, Chinese  BLANC – family of trainable evaluation metrics Consistently performs well on evaluating machine translation output

Future Work  Recently obtained a two year NSF Grant  Try different models and improve the training mechanism for BLANC Is a local exponential model the best choice? Is a global exponential model the best choice? Explore different training methods  Integrate additional features  Apply BLANC to other tasks (summarization)

References  Leusch, Ueffing, Vilar and Ney, “Preprocessing and Normalization for Automatic Evaluation of Machine Translation.” IEEMTS Workshop, ACL 2005  Lin and Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics”, ACL 2004  Lita, Lavie and Rogati, “BLANC: Learning Evaluation Metrics for MT”, HLT-EMNLP 2005  Papineni, Roukos, Ward and Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation”, IBM Report 2002  Akiba, Imamura and Sumita, “Using Multiple Edit Distances to Automatically Rank Machine Translation Output”, MT Summit VIII 2001  Su, Wu and Chang, “A new Quantitative Quality Measure for a Machine Translation System”, COLING 1992

Thank you

Acronyms, acronyms …  Official: Broad Learning Adaptation for Numeric Criteria  Inspiration: white light contains light of all frequencies  Fun: Building on Legacy Acronym Naming Conventions  Bleu, Rouge, Orange, Pourpre … Blanc?