Download presentation
Presentation is loading. Please wait.
1
Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang (Joy) Stephan Vogel Language Technologies Institute School of Computer Science Carnegie Mellon University
2
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 2 Outline Automatic Machine Translation Evaluation –BLEU –Modified BLEU –NIST MTEval Confidence Intervals based on Bootstrap Percentile –Algorithm –Comparing two MT systems –Implementation Discussions –How much testing data is needed? –How many reference translations are needed? –How many bootstrap samples are needed?
3
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 3 Automatic Machine Translation Evaluation Subjective MT evaluations –Fluency and Adequacy scored by human judges –Very expensive in time and money Objective automatic MT evaluations –Inspired by the Word Error Rate metric used by ASR research –Measuring the “closeness” between the MT hypothesis and human reference translations –Precision: n-gram precision –Recall: Against the best matched reference Approximated by brevity penalty –Cheap, fast –Highly correlated with subjective evaluations –MT research has greatly benefited from automatic evaluations –Typical metrics: IBM BLEU, CMU M-BLEU, CMU METEOR, NIST MTeval, NYU GTM
4
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 4 BLEU Metrics Proposed by IBM’s SMT group (Papineni et al, 2002) Widely used in MT evaluations –DARPA TIDES MT evaluation –IWSLT evaluation –TC-Star BLEU Metric: –P n: Modified n-gram precision –Geometric mean of p 1, p 2,..p n –BP: Brevity penalty –Usually, N=4 and w n =1/N. c: length of the MT hypothesis r: effective reference length
5
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 5 BLEU Metric Example: –MT Hypothesis: the gunman was shot dead by police. –Reference 1: The gunman was shot to death by the police. –Reference 2: The gunman was shot to death by the police. –Reference 3: Police killed the gunman. –Reference 4: The gunman was shot dead by the police. Precision: p 1 =1.0(8/8) p 2 =0.86(6/7) p 3 =0.67(4/6) p 4 =0.6 (3/5) Brevity Penalty: c=8, r=9, BP=0.8825 Final Score : Usually n-gram precision and BP are calculated on the test set level
6
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 6 Modified BLEU Metric BLEU focuses heavily on long n-grams because of the geometric mean Example: Modified BLEU Metric (Zhang, 2004) –Arithmetic mean of the n-gram precision –More balanced contribution from different n-grams p1p1 p2p2 p3p3 p4p4 BLEU MT11.00.210.110.060.19 MT20.350.320.280.260.30
7
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 7 NIST MTEval Metric Motivation –“Weight more heavily those n-grams that are more informative” (NIST 2002) –Use a geometric mean of the n-gram score Pros: more sensitive than BLEU Cons: –Info gain for 2-gram and up is not meaningful 80% of the score comes from unigram matches Most matched 5-grams have info gain 0 ! –Score increases when the testing set size increases
8
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 8 Questions Regarding MT Evaluation Metrics Do they rank the MT systems in the same way as human judges? –IBM showed a strong correlation between BLEU and human judgments How reliable are the automatic evaluation scores? How sensitive is a metric? –Sensitivity: the metric should be able to distinguish between systems of similar performance Is the metric consistent? –Consistency: the difference between systems is not affected by the selection of testing/reference data How many reference translations are needed? How much testing data is sufficient for evaluation? If we can measure the confidence interval of the evaluation scores, we can answer the above questions
9
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 9 Outline Overview of Automatic Machine Translation Evaluation –BLEU –Modified BLEU –NIST MTEval Confidence Intervals based on Bootstrap Percentile –Algorithm –Comparing two MT systems –Implementation Discussions –How much testing data is needed? –How many reference translations are needed? –How many bootstrap samples are needed?
10
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 10 Measuring the Confidence Intervals One BLEU/M-BLEU/NIST score per test set How accurate is this score? To measure the confidence interval a population is required Building a test set with multiple human reference translations is expensive Solution: bootstrapping (Efron 1986) –Introduced in 1979 as a computer-based method for estimating the standard errors of a statistical estimation –Resampling: creating an artificial population by sampling with replacement –Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics
11
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 11 A Schematic of the Bootstrapping Process Score 0
12
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 12 An Efficient Implementation Translate and evaluate 2,000 test sets? –No Way! Resample the n-gram precision information for the sentences –Most MT systems are context independent at the sentence level; –MT evaluation metrics are based on information collected for each testing sentences –E.g. for BLEU/M-BLEU and NIST RefLen: 17 20 19 24 ClosestRefLen 17 1-gram: 15 10 89.34 2-gram: 14 4 9.04 3-gram: 13 3 3.65 4-gram: 12 2 2.43 –Similar for human judgment and other MT metrics Approximation for NIST information gain Scripts available at: http://projectile.is.cs.cmu.edu/research/public/tools/bootStrap/tutorial.htm http://projectile.is.cs.cmu.edu/research/public/tools/bootStrap/tutorial.htm
13
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 13 Algorithm Original test suite T 0 with N segments and R reference translations Represent the i-th segment of T 0 as an n-tuple: T 0 [i]= for(b=1;b<= B ;b++){ for(i=1;i<=N;i++){ s = random(1,N); T b [i] = T 0 [s]; } Calculating BLEU/M-BLEU/NIST for T b } Sort B BLEU/M-BLEU/NIST scores Output scores ranked 2.5%th and 97.5%
14
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 14 Confidence Intervals 7 Chinese-English MT systems from June 2002 TIDES evaluation Observations: –Relative confidence interval: NIST<M-Bleu<Bleu –NIST scores have more discriminative powers than BLEU –The strong impact of long n-grams makes the BLEU score less stable or … introduces more noise)
15
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 15 Are Two MT Systems Different? Comparing two MT systems’ performance –Using the similar method as for single system –E.g. Diff(Sys1-Sys2):Median=-1.7355 [-1.5453,-1.9056] –If the confidence intervals overlap with 0, two systems are not significantly different M-Bleu and NIST have more discriminative power than Bleu Automatic metrics have pretty high correlations with the human ranking Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not
16
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 16 Outline Overview of Automatic Machine Translation Evaluation –BLEU –Modified BLEU –NIST MTEval Confidence Intervals based on Bootstrap Percentile –Algorithm –Comparing two MT systems –Implementation Discussions –How much testing data is needed? –How many reference translations are needed? –How many bootstrap samples are needed? –Non-parametric interval or normal/t-intervals?
17
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 17 How much testing data is needed
18
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 18 How much testing data is needed NIST scores increase steadily with the growing test set size The distance between the scores of the different systems remains stable when using 40% or more of the test set The confidence intervals become narrower for larger test set Rule of thumb: doubling the testing data size narrows the confidence interval by 30% (theoretically justified) * System A, (Bootstrap Size B =2000)
19
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 19 Effects of Using Multiple References Single reference from one translator may favor some systems Increasing the number of references narrows down the relative confidence interval
20
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 20 How Many Reference Translations are Sufficient? Confidence intervals become narrower with more reference translations [100%](1-ref) ~ [80~90%](2-ref) ~ [70~80%](3-ref) ~[60%~70%](4-ref) One additional reference translation compensates for 10~15% of testing data * System A, (Bootstrap Size B =2000)
21
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 21 Do We Really Need Multiple References? Parallel multiple reference Single reference from multiple translators* –Reduced bias from different translators –Yields the same confidence interval/reliability as the parallel multiple reference –Costs only half of the effort compared to building a parallel multiple reference set *Originally proposed in IBM’s BLEU report
22
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 22 Single Reference from Multiple Translators Reduced bias by mixing from different translators Yields the same confidence intervals
23
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 23 Bootstrap-t Interval vs. Normal/t Interval Normal distribution / t-distribution Student’s t-interval (when n is small) Bootstrap-t interval –For each bootstrap sample, calculate –The alpha-th percentile is estimated by the value, such that –Bootstrap-t interval is – e.g. if B =1000, the 50 th largest value and the 950 th largest value gives the bootstrap-t interval Assuming that
24
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 24 Bootstrap-t interval vs. Normal/t interval (Cont.) Bootstrap-t intervals assumes no distribution, but –It can give erratic results –It can be heavily influenced by a few outlying data points When B is large, the bootstrap sample scores are pretty close to normal distribution Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval ( B =500) –STDEV=0.27 for bootstrap-t interval –STDEV=0.14 for normal/student-t interval
25
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 25 The Number of Bootstrap Replications B Ideal bootstrap estimate of the confidence interval takes B Computational time increases linearly with B The greater B, the smaller the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative confidence interval –STDEV = 0.60 when B =100; STDEV = 0.27 when B =500 Two rules of thumb: –Even a small B, say B =100 is usually informative –B >1000 gives quite satisfactory results
26
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 26 Conclusions Using bootstrapping method to measure the confidence intervals for MT evaluation metrics Using confidence intervals to study the characteristics of an MT evaluation metric –Correlation with human judgments –Sensitivity –Consistency Modified BLEU is a better metric than BLEU Single reference from multiple translators is as good as parallel multiple references and costs only half the effort
27
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 27 References Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p. 54-77. F. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan. M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp. 409-412. G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO. I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada. King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA. S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece. NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co- Occurrence Statistics, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL. Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal.
28
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 28 Questions and Comments?
29
Oct 2004 TMI, Baltimore, MD Ying Zhang, Stephan Vogel LTI, Carnegie Mellon University 29 N-gram Contributions to NIST Score
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.