Download presentation
Presentation is loading. Please wait.
Published byAmelia McCoy Modified over 9 years ago
1
A daptable A utomatic E valuation M etrics for M achine T ranslation L ucian V lad L ita joint work with A lon L avie and M onica R ogati
2
Outline BLEU and ROUGE metric families BLANC –family of adaptable metrics All common skip n-grams Local n-gram model Overall model Experiments and results Conclusions Future work References
3
Automatic Evaluation Metrics Manual human judgments Edit distance ( WER ) Word overlap ( PER ) Metrics based on n-grams n-gram precision (BLEU) weighted n-grams (NIST) longest common subsequence (Rouge-L) skip 2-grams (pairs of ordered words – Rouge-S) Integrate additional knowledge (synonyms, stemming) (METEOR) t i m e translation quality ( candidate | reference )
4
Automatic Evaluation Metrics Manual human judgments Machine translation (MT) evaluation metrics Manually created estimators of quality Improvements often shown on the same data Rigid notion of quality Based on existing judgment guidelines Goal: trainable evaluation metric t i m e translation quality ( candidate | reference )
5
Goal: Trainable MT Metric Build on the features used by established metrics ( BLEU, ROUGE ) Extendable – additional features/processing Correlate well with human judgments Trainable models Different notions of “translation quality” E.g. computer consumption vs. human consumption Different features will be more important for different Languages Domains
6
The WER Metric R: the students asked the professor C: the students talk professor Word Error Rate = # of word insertions, deletions, and substitutions # words in R Transform reference (human) translation R into candidate (machine) translation C Levenshtein (edit) distance
7
The PER Metric Word overlap between candidate (machine) translation C and reference (human) translation R Bag of words Position Independent Error Rate | count of w in R – count of w in C | # words in R R: the students asked the professor C: the students talk professor w in C
8
The BLEU Metric Modified n-gram precisions 1-gram precision = 3 / 4 2-gram precision = 1 / 3 … Contiguous n-gram overlap between reference (human) translation R and candidate (machine) translation C R: the students asked the professor C: the students talk professor BLEU = ( P i-gram ) 1/n * ( brevity penalty ) i = 1 n
9
The BLEU Metric BLEU is the most established evaluation metric in MT Basic feature: contiguous n-grams of all sizes Computes modified precision Uses a simple formula to combine all precision scores Bigram precision is “as important” as unigram precision Brevity penalty – quasi recall
10
The Rouge-L Metric R: the students asked the professor C: the students talk professor Longest common subsequence (LCS) of the candidate (machine) translation C and reference (human) translation R LCS = 3 “the students … professor” Precision LCS ( C,R ) # words in C == Recall LCS ( C,R ) # words in R == Rouge-L = harmonic mean (Precision, Recall) = 2PR / (P+R)
11
The Rouge-S Metric R: the students asked the professor C: the students talk professor Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R Skip 2 ( C ) = 6 { “the students”, “the talk”, “the professor”, “students talk”, “students professor”, “talk professor” } Skip 2 ( C,R ) = 3 { “the students”, “the professor”, “students professor” } 11
12
The Rouge-S Metric R: the students asked the professor C: the students talk professor Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R Precision Skip 2 ( C,R ) |C| choose 2 == Recall Skip 2 ( C,R ) |R| choose 2 == Rouge-S = harmonic mean (Precision, Recall)
13
The ROUGE Metrics Rouge-L Basic feature: longest common subsequence LCS Size of the longest common skip n-gram Weighted LCS Rouge-S Basic feature: skip bigrams Skip bigram gap size irrelevant Limited to n-grams of size 2 Both use harmonic mean (F1-measure) to combine precision and recall
14
Is BLEU Trainable? Can we assign/learn relative importance between P 2 and P 3 ? Simplest model: regression Train/test on past MT output [C,R] Inputs: P 1, P 2, P 2 … and brevity penalty P 1, P 2, P 2, b p HJ fluency score BLEU = ( P i-gram ) 1/n * ( brevity penalty ) i = 1 n
15
Is Rouge Trainable? Simple regression on Size of the longest common skip n-gram Number of common skip 2-grams Second order parameters (dependencies) – model is not linear in its inputs anymore Window size (computation reasons) F-measure to F (replacing brevity penalty) Potential models Iterative methods Hill climbing? Non-linear (B p, |LCS|, Skip 2, F, ws) HJ fluency score
16
The BLANC Metric Family Generalization of established evaluation metrics N-gram features used by BLEU and ROUGE Trainable parameters Skip n-gram contiguity in C Relative importance of n (i.e. bigrams vs. trigrams) Precision-recall balance Adaptability to different: Translation quality criteria, languages, domains Allow additional processing/features (e.g. METEOR matching)
17
All Common Skip N-grams C: the one pure student brought the necessary condiments R: the new student brought the food C: the one pure student brought the necessary condiments R: the new student brought the food (,,, ) 1 1 1 10 1 2 3 0 0 1 3 0 0 0 1 # 1grams: 4 # 2grams: 6 # 3grams: 4 # 4grams: 1 the(0,0) the(4,5) student(2,3)brought(3,4) the(0,5) the(4,0)
18
All Common Skip N-grams C: the one pure student brought the necessary condiments R: the new student brought the food C: the one pure student brought the necessary condiments R: the new student brought the food the(0,0) the(4,5) student(2,3)brought(3,4) (,,, ) 1 1 1 10 s 22 s 32 3 0 0 1 ? 0 0 0 1 score(1-grams) score(2-grams) score(3-grams) score(4-grams) score(the 0,0,student 2,3 ) ’’ ’’
19
All Common Skip N-grams Algorithms literature: all common subsequences Listing vs. counting subsequences Interested in counting # common subsequences of size 1, 2, 3 … Replace counting with score over all n-grams of the same size Score(w 1 …w i,w i+1 …w n ) = Score(w 1 …w i ) Score(w 1+1 …w n ) BLANC i (C,R) = f(common i-grams of C,R)
20
Modeling Gap Size Importance skip 3-grams … the ____ ____ ____ ____ student ____ ____ has … … the ____ student has … … the student has …
21
Modeling Gap Size Importance Model the importance of skip n-gram gap size as an exponential function with one parameter ( ) Special cases Gap size doesn’t matter (Rouge-S): = 0 No gaps are allowed (BLEU): = large number C: … the __ __ __ __ student __ __ has …
22
Modeling Candidate-Reference Gap Difference skip 3-gram match C 1 : … the ____ ____ ____ ____ student ____ ____ has … R: … the ____ student has … C 2 : … the student has …
23
Modeling Candidate-Reference Gap Difference Model the importance of gap size difference between the candidate and reference translations as an exponential function with one parameter ( ) Special cases Gap size differences do not matter: = 0 Skip 2-gram overlap (Rouge-S): = 0, = 0, n=2 Largest skip n-gram (Rouge-L): = 0, = 0, n=LCS C: … the __ __ __ __ student __ __ has … R: … the __ student has …
24
Skip N-gram Model Incorporate simple scores into an exponential model Skip n-gram gap size Candidate-reference gap size difference Possible to incorporate higher level features Partial skip n-grams matching (e.g. synonyms, stemming) “the __ students” vs. “the __ pupils”, “the __ students” vs. “the __ student” From word classing to syntax e.g. score( “students __ __ professor”) ? score (“the __ __ of”)
25
Candidates References Find Common Skip Ngram Find All Common Skip Ngrams Compute Skip Ngram Pair Features e - i f i (sn) Combine All Common Skip Ngram Scores Global parameters precision/recall f(skip ngram size) Compute Correlation Coefficient pearson spearman Criterion adequacy fluency f(adequacy, fluency) other Trained Metric BLANC Overview
26
Incorporating Global Features Compute BLANC precision and recall for each n- gram size i Global exponential model based on N-gram size: I BLANC i (C,R) i=1..n F-measure parameter F for each size i Average reference segment size Other scores (i.e. BLEU, ROUGE-L, ROUGE-S) … Train for average human judgment vs. train for best overall correlation (as the error function)
27
Experiment Setup Tides evaluation data Arabic English 2003, 2004 Training and test sentences separated by year Optimized: n-gram contiguity difference in gap size (C vs. R) Balance between precision and recall Correlation using the Pearson correlation coefficient Compared BLANC to BLEU and ROUGE Trained BLANC for Fluency vs. adequacy System level vs. sentence level
28
Tides 2003 Arabic Evaluation System LevelSentence Level Method AdequacyFluencyAdequacyFluency BLEU0.9500.9340.3820.286 NIST0.9620.9390.4390.304 Rouge-L0.9740.9260.4400.328 Rouge-S0.9490.9350.3600.328 BLANC0.9880.9790.4920.391 Pearson [-1,1] correlation with human judgments at system level and sentence level
29
Tides 2004 Arabic Evaluation System LevelSentence Level Method AdequacyFluencyAdequacyFluency BLEU0.9780.9940.4460.337 NIST0.9870.9520.5290.358 Rouge-L0.9810.9850.5380.412 Rouge-S0.9370.9800.3670.408 BLANC0.9820.9940.5650.438 Pearson [-1,1] correlation with human judgments at system level and sentence level
30
Advantages of BLANC Consistently good performance Candidate evaluation is fast Adaptable fluency and adequacy languages, domains Help train MT systems for specific tasks e.g. information extraction, information retrieval Model complexity Can be optimized for specific MT system performance levels
31
Disadvantages of BLANC Training data vs. number of parameters Model complexity Guarantees of the training process
32
Conclusions Move towards learning evaluation metrics Quality criteria – e.g. fluency, adequacy Correlation coefficients – e.g. Pearson, Spearman Languages – e.g. English, Arabic, Chinese BLANC – family of trainable evaluation metrics Consistently performs well on evaluating machine translation output
33
Future Work Recently obtained a two year NSF Grant Try different models and improve the training mechanism for BLANC Is a local exponential model the best choice? Is a global exponential model the best choice? Explore different training methods Integrate additional features Apply BLANC to other tasks (summarization)
34
References Leusch, Ueffing, Vilar and Ney, “Preprocessing and Normalization for Automatic Evaluation of Machine Translation.” IEEMTS Workshop, ACL 2005 Lin and Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics”, ACL 2004 Lita, Lavie and Rogati, “BLANC: Learning Evaluation Metrics for MT”, HLT-EMNLP 2005 Papineni, Roukos, Ward and Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation”, IBM Report 2002 Akiba, Imamura and Sumita, “Using Multiple Edit Distances to Automatically Rank Machine Translation Output”, MT Summit VIII 2001 Su, Wu and Chang, “A new Quantitative Quality Measure for a Machine Translation System”, COLING 1992
35
Thank you
36
Acronyms, acronyms … Official: Broad Learning Adaptation for Numeric Criteria Inspiration: white light contains light of all frequencies Fun: Building on Legacy Acronym Naming Conventions Bleu, Rouge, Orange, Pourpre … Blanc?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.