Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie.

Slides:



Advertisements
Similar presentations
An Interactive-Voting Based Map Matching Algorithm
Advertisements

T-tests continued.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Arthur Chan Prepared for Advanced MT Seminar
Evaluation of Text Generation: Automatic Evaluation vs. Variation Amanda Stent, Mohit Singhai, Matthew Marge.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
Re-evaluating Bleu Alison Alvarez Machine Translation Seminar February 16, 2006.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
MT Evaluation CA446 Week 9 Andy Way School of Computing Dublin City University, Dublin 9, Ireland
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Bootstrapping LING 572 Fei Xia 1/31/06.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Experimental Evaluation
Chapter 11: Inference for Distributions
9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.
Bootstrapping applied to t-tests
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Automatic Evaluation Philipp Koehn Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology.
1 Terminating Statistical Analysis By Dr. Jason Merrick.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 2 – Slide 1 of 25 Chapter 11 Section 2 Inference about Two Means: Independent.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
Comparing Two Population Means
Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者:郝柏翰 2013/06/04 Thorsten Brants, Ashok.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Arthur Chan Prepared for Advanced MT Seminar
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
Comparing two sample means Dr David Field. Comparing two samples Researchers often begin with a hypothesis that two sample means will be different from.
Experimental Evaluation of Learning Algorithms Part 1.
Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation M arine C ARPUAT and D ekai W U Human Language Technology.
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
This material is approved for public release. Distribution is limited by the Software Engineering Institute to attendees. Sponsored by the U.S. Department.
Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
Statistical Analysis II Lan Kong Associate Professor Division of Biostatistics and Bioinformatics Department of Public Health Sciences December 15, 2015.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Estimation by Intervals Confidence Interval. Suppose we wanted to estimate the proportion of blue candies in a VERY large bowl. We could take a sample.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Lecture 22 Dustin Lueker.  Similar to testing one proportion  Hypotheses are set up like two sample mean test ◦ H 0 :p 1 -p 2 =0  Same as H 0 : p 1.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Estimating standard error using bootstrap
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Evaluation of measuring tools: reliability
Estimating a Population Proportion
STA 291 Spring 2008 Lecture 22 Dustin Lueker.
Presentation transcript:

Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 2 Outline Automatic Machine Translation Evaluation –BLEU –Modified BLEU –NIST MTEval Confidence Intervals based on Bootstrap Percentile –Algorithm –Comparing two MT systems –Implementation Discussions –How much testing data is needed? –How many reference translations are needed? –How many bootstrap samples are needed?

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 3 Automatic Machine Translation Evaluation Subjective MT evaluations –Fluency and Adequacy scored by human judges –Very expensive in time and money Objective automatic MT evaluations –Inspired by the Word Error Rate metric used by ASR research –Measuring the “closeness” between the MT hypothesis and human reference translations –Precision: n-gram precision –Recall: Against the best matched reference Approximated by brevity penalty –Cheap, fast –Highly correlated with subjective evaluations –MT research has greatly benefited from automatic evaluations –Typical metrics: IBM BLEU, CMU M-BLEU, CMU METEOR, NIST MTeval, NYU GTM

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 4 BLEU Metrics Proposed by IBM’s SMT group (Papineni et al, 2002) Widely used in MT evaluations –DARPA TIDES MT evaluation –IWSLT evaluation –TC-Star BLEU Metric: –P n: Modified n-gram precision –Geometric mean of p 1, p 2,..p n –BP: Brevity penalty –Usually, N=4 and w n =1/N. c: length of the MT hypothesis r: effective reference length

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 5 BLEU Metric Example: –MT Hypothesis: the gunman was shot dead by police. –Reference 1: The gunman was shot to death by the police. –Reference 2: The gunman was shot to death by the police. –Reference 3: Police killed the gunman. –Reference 4: The gunman was shot dead by the police. Precision: p 1 =1.0(8/8) p 2 =0.86(6/7) p 3 =0.67(4/6) p 4 =0.6 (3/5) Brevity Penalty: c=8, r=9, BP= Final Score : Usually n-gram precision and BP are calculated on the test set level

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 6 Modified BLEU Metric BLEU focuses heavily on long n-grams because of the geometric mean Example: Modified BLEU Metric (Zhang, 2004) –Arithmetic mean of the n-gram precision –More balanced contribution from different n-grams p1p1 p2p2 p3p3 p4p4 BLEU MT MT

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 7 NIST MTEval Metric Motivation –“Weight more heavily those n-grams that are more informative” (NIST 2002) –Use a geometric mean of the n-gram score Pros: more sensitive than BLEU Cons: –Info gain for 2-gram and up is not meaningful 80% of the score comes from unigram matches Most matched 5-grams have info gain 0 ! –Score increases when the testing set size increases

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 8 Questions Regarding MT Evaluation Metrics Do they rank the MT systems in the same way as human judges? –IBM showed a strong correlation between BLEU and human judgments How reliable are the automatic evaluation scores? How sensitive is a metric? –Sensitivity: the metric should be able to distinguish between systems of similar performance Is the metric consistent? –Consistency: the difference between systems is not affected by the selection of testing/reference data How many reference translations are needed? How much testing data is sufficient for evaluation? If we can measure the confidence interval of the evaluation scores, we can answer the above questions

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 9 Outline Overview of Automatic Machine Translation Evaluation –BLEU –Modified BLEU –NIST MTEval Confidence Intervals based on Bootstrap Percentile –Algorithm –Comparing two MT systems –Implementation Discussions –How much testing data is needed? –How many reference translations are needed? –How many bootstrap samples are needed?

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 10 Measuring the Confidence Intervals One BLEU/M-BLEU/NIST score per test set How accurate is this score? To measure the confidence interval a population is required Building a test set with multiple human reference translations is expensive Solution: bootstrapping (Efron 1986) –Introduced in 1979 as a computer-based method for estimating the standard errors of a statistical estimation –Resampling: creating an artificial population by sampling with replacement –Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 11 A Schematic of the Bootstrapping Process Score 0

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 12 An Efficient Implementation Translate and evaluate 2,000 test sets? –No Way! Resample the n-gram precision information for the sentences –Most MT systems are context independent at the sentence level; –MT evaluation metrics are based on information collected for each testing sentences –E.g. for BLEU/M-BLEU and NIST RefLen: ClosestRefLen 17 1-gram: gram: gram: gram: –Similar for human judgment and other MT metrics Approximation for NIST information gain Scripts available at:

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 13 Algorithm Original test suite T 0 with N segments and R reference translations Represent the i-th segment of T 0 as an n-tuple: T 0 [i]= for(b=1;b<= B ;b++){ for(i=1;i<=N;i++){ s = random(1,N); T b [i] = T 0 [s]; } Calculating BLEU/M-BLEU/NIST for T b } Sort B BLEU/M-BLEU/NIST scores Output scores ranked 2.5%th and 97.5%

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 14 Confidence Intervals 7 Chinese-English MT systems from June 2002 TIDES evaluation Observations: –Relative confidence interval: NIST<M-Bleu<Bleu –NIST scores have more discriminative powers than BLEU –The strong impact of long n-grams makes the BLEU score less stable or … introduces more noise)

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 15 Are Two MT Systems Different? Comparing two MT systems’ performance –Using the similar method as for single system –E.g. Diff(Sys1-Sys2):Median= [ , ] –If the confidence intervals overlap with 0, two systems are not significantly different M-Bleu and NIST have more discriminative power than Bleu Automatic metrics have pretty high correlations with the human ranking Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 16 Outline Overview of Automatic Machine Translation Evaluation –BLEU –Modified BLEU –NIST MTEval Confidence Intervals based on Bootstrap Percentile –Algorithm –Comparing two MT systems –Implementation Discussions –How much testing data is needed? –How many reference translations are needed? –How many bootstrap samples are needed? –Non-parametric interval or normal/t-intervals?

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 17 How much testing data is needed

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 18 How much testing data is needed NIST scores increase steadily with the growing test set size The distance between the scores of the different systems remains stable when using 40% or more of the test set The confidence intervals become narrower for larger test set Rule of thumb: doubling the testing data size narrows the confidence interval by 30% (theoretically justified) * System A, (Bootstrap Size B =2000)

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 19 Effects of Using Multiple References Single reference from one translator may favor some systems Increasing the number of references narrows down the relative confidence interval

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 20 How Many Reference Translations are Sufficient? Confidence intervals become narrower with more reference translations [100%](1-ref) ~ [80~90%](2-ref) ~ [70~80%](3-ref) ~[60%~70%](4-ref) One additional reference translation compensates for 10~15% of testing data * System A, (Bootstrap Size B =2000)

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 21 Do We Really Need Multiple References? Parallel multiple reference Single reference from multiple translators* –Reduced bias from different translators –Yields the same confidence interval/reliability as the parallel multiple reference –Costs only half of the effort compared to building a parallel multiple reference set *Originally proposed in IBM’s BLEU report

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 22 Single Reference from Multiple Translators Reduced bias by mixing from different translators Yields the same confidence intervals

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 23 Bootstrap-t Interval vs. Normal/t Interval Normal distribution / t-distribution Student’s t-interval (when n is small) Bootstrap-t interval –For each bootstrap sample, calculate –The alpha-th percentile is estimated by the value, such that –Bootstrap-t interval is – e.g. if B =1000, the 50 th largest value and the 950 th largest value gives the bootstrap-t interval Assuming that

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 24 Bootstrap-t interval vs. Normal/t interval (Cont.) Bootstrap-t intervals assumes no distribution, but –It can give erratic results –It can be heavily influenced by a few outlying data points When B is large, the bootstrap sample scores are pretty close to normal distribution Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval ( B =500) –STDEV=0.27 for bootstrap-t interval –STDEV=0.14 for normal/student-t interval

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 25 The Number of Bootstrap Replications B Ideal bootstrap estimate of the confidence interval takes B Computational time increases linearly with B The greater B, the smaller the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative confidence interval –STDEV = 0.60 when B =100; STDEV = 0.27 when B =500 Two rules of thumb: –Even a small B, say B =100 is usually informative –B >1000 gives quite satisfactory results

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 26 Conclusions Using bootstrapping method to measure the confidence intervals for MT evaluation metrics Using confidence intervals to study the characteristics of an MT evaluation metric –Correlation with human judgments –Sensitivity –Consistency Modified BLEU is a better metric than BLEU Single reference from multiple translators is as good as parallel multiple references and costs only half the effort

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 27 References Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p F. Och Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan. M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO. I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada. King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA. S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece. NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co- Occurrence Statistics, Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL. Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal.

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 28 Questions and Comments?

Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 29 N-gram Contributions to NIST Score