Download presentation
Presentation is loading. Please wait.
Published byAugustus Morris Modified over 8 years ago
1
Machine Translation Course 10 Diana Trandab ă ț 2015-2016
2
MT Evaluation is difficult – There is no single correct translation (language variability) – Human evaluation is subjective – How good is “good enough”? Depends on application – Is system A better than system B? Depends on specific criteria… MT Evaluation is a research topic in itself! How do we assess whether an evaluation method or metric is good?
3
Machine Translation Evaluation Quality assessment at sentence (segment) level vs. task-based evaluation Adequacy (is the meaning translated correctly?) vs. Fluency (is the output grammatical and fluent?) vs. Ranking (is translation-1 better than translation-2?) Manual evaluation: – Subjective Sentence Error Rates – Correct vs. Incorrect translations – Error categorization Automatic evaluation: – Exact Match, BLEU, METEOR, NIST etc.
4
Human MT evaluation Given machine translation output and source and/or reference translation, assess the quality of the machine translation output Adequacy: Does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? Fluency: Is the output fluent? This involves both grammatical correctness and idiomatic word choices. Source Sentence: Le chat entre dans la chambre. Adequate Fluent translation: The cat enters the room. Adequate Disfluent translation: The cat enters in the bedroom. Fluent Inadequate translation: My Granny plays the piano. Disfluent Inadequate translation: piano Granny the plays My.
5
Human MT evaluation - scales
6
Main Types of Human Assessments Adequacy and Fluency scores Human preference ranking of translations at the sentence- level Post-editing Measures: – Post-editor editing time/effort measures Human Editability measures: can humans edit the MT output into a correct translation? Task-based evaluations: was the performance of the MT system sufficient to perform a particular task?
7
MT Evaluation by Humans Problems: – Intra-coder Agreement: consistency of same human judge – Inter-coder Agreement: judgment agreement across multiple judges of quality – Very expensive (w.r.t time and money) – Not always available – Can’t help day-to-day system development
8
Goals for Evaluation Metrics Low cost: reduce time and money spent on carrying out evaluation Tunable: automatically optimize system performance towards metric Meaningful: score should give intuitive interpretation of translation quality Consistent: repeated use of metric should give same results Correct: metric must rank better systems higher
9
Other Evaluation Criteria When deploying systems, considerations go beyond quality of translations Speed: we prefer faster machine translation systems Size: fits into memory of available machines (e.g., handheld devices) Integration: can be integrated into existing workflow Customization: can be adapted to user's needs
10
Automatic Machine Translation Evaluation Objective, cheap, fast Measuring the “closeness” between the MT hypothesis and human reference translations – Precision: n-gram precision – Recall: Against the best matched reference Approximated by brevity penalty Highly correlated with human evaluations MT research has greatly benefited from automatic evaluations Typical metrics: BLEU, NIST, F-Score, Meteor, TER
11
Automatic Machine Translation Evaluation Automatic MT metrics are not sufficient: – What does a score of 30.0 or 50.0 mean? – Existing automatic metrics are crude and at times biased – Automatic metrics don’t provide sufficient insight for error analysis – Different types of errors have different implications depending on the underlying task in which MT is used Need for reliable human measures in order to develop and assess automatic metrics for MT evaluation
12
History of Automatic Evaluation of MT 2002: NIST starts MT Eval series under DARPA TIDES program, using BLEU as the official metric 2003: Och and Ney propose MERT for MT based on BLEU 2004: METEOR first comes out 2006: TER is released, DARPA GALE program adopts HTER as its official metric 2006: NIST MT Eval starts reporting METEOR, TER and NIST scores in addition to BLEU, official metric is still BLEU 2007: Research on metrics takes off… several new metrics come out 2007: MT research papers increasingly report METEOR and TER scores in addition to BLEU …
13
Precision and Recall SYSTEM A: Israeli officials responsibility of airport safety REFERENCE: Israeli officials are responsible for airport security Precision: correct / output-length = 3/6 = 50% Recall: correct/reference-length = 3/7 = 43% F-measure: 2* Precision * Recall / (Precision +Recall) = 46%
14
Precision and Recall SYSTEM A: Israeli officials responsibility of airport safety REFERENCE: Israeli officials are responsible for airport security SYSTEM B: airport security Israeli officials are responsible Precision: correct / output-length = 6/6 = 100% Recall: correct/reference-length = 6/7 = 85% F-measure: 2* Precision * Recall / (Precision +Recall) = 92% Is system B better than system A?
15
Word Error Rate Minimum number of editing steps to transform output to reference – match: words match, no cost – substitution: replace one word with another – insertion: add word – deletion: drop word Levenshtein distance
16
Word Error Rate MetricSystem ASystem B Word error rate57%71%
17
BLEU – Its Motivation Central Idea: – “The closer a machine translation is to a professional human translation, the better it is.” Implication – An evaluation metric could be evaluated – if it correlates with human evaluation, it would be a useful metric BLEU was proposed – as an aid – as a quick substitute of humans when needed
18
What is BLEU? A Big Picture Proposed by IBM [Papineni et al, 2002] Main ideas: – Exact matches of words – Match against a set of reference translations for greater variety of expressions – Account for Adequacy by looking at word precision – Account for Fluency by calculating n-gram precisions for n=1,2,3,4 – Introduces “Brevity Penalty” (system output/reference output)
19
BLEU score – Final score is weighted geometric average of the n-gram scores – Calculate aggregate score over a large test set
20
BLEU Example
21
Multiple Reference Translations To account for variability, use multiple reference translations – n-grams may match in any of the references – closest reference length used
22
N-gram Precision: an Example Candidate 1: It is a guide to action which ensures that the military always obey the commands the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Clearly Candidate 1 is better Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party
23
N-gram Precision To rank Candidate 1 higher than 2 – Just count the number of N-gram matches – The match could be position-independent – Reference could be matched multiple times – No need to be linguistically-motivated
24
BLEU – Example : Unigram Precision Candidate 1: It is a guide to action which ensures that the military always obey the commands of the party. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party. N-gram Precision : 17
25
Example : Unigram Precision (cont.) Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed directions of the party. N-gram Precision : 8
26
Issue of N-gram Precision What if some words are over-generated? – e.g. “the” An extreme example Candidate: the the the the the the the. Reference 1: The cat is on the mat. Reference 2: There is a cat on the mat. N-gram Precision: 7 (Something wrong) Intuitively : reference word should be exhausted after it is matched.
27
Modified N-gram Precision : Procedure Procedure – Count the max number of times a word occurs in any single reference – Clip the total count of each candidate word – Modified N-gram Precision equal to Clipped count/Total no. of candidate word Example: Ref 1: The cat is on the mat. Ref 2: There is a cat on the mat. “the” has max count 2 Unigram count = 7 Clipped unigram count = 2 Total no. of counts = 7 Modified-ngram precision: – Clipped count = 2 – Total no. of counts =7 – Modified-ngram precision = 2/7
28
Different N in Modified N-gram Precision N > 1 (groups of words) is computed in a similar way – When 1-gram precision is high, the reference tends to satisfy adequacy – When longer n-gram precision is high, the reference tends to account for fluency
29
The METEOR Metric Metric developed by Lavie et al. at CMU/LTI METEOR = Metric for Evaluation of Translation with Explicit Ordering Main new ideas: – Include both Recall and Precision as score components – Look only at unigram Precision and Recall – Align MT output with each reference individually and take score of best pairing – Matching takes into account word variability (via stemming) and synonyms – Tuning of scoring component weights to optimize correlation with human judgments
30
The METEOR Metric Partial credit for matching stems SYSTEM: Jim went home REFERENCE: Jim goes home Partial credit for matching synonyms SYSTEM: Jim walks home REFERENCE: Jim goes home Use of paraphrases
31
METEOR vs BLEU Highlights of Main Differences: – METEOR word matches between translation and references includes semantic equivalents (inflections and synonyms) – METEOR combines Precision and Recall (weighted towards recall) instead of BLEU’s “brevity penalty” – METEOR uses a direct word-ordering penalty to capture fluency instead of relying on higher order n-grams matches – METEOR can tune its parameters to optimize correlation with human judgments Outcome: METEOR has significantly better correlation with human judgments, especially at the segment-level
32
Correlation with Human Judgments Human judgment scores for adequacy and fluency, each [1-5] (or sum them together) Correlation of metric scores with human scores at the system level – Can rank systems – Even coarse metrics can have high correlations Correlation of metric scores with human scores at the sentence level – Evaluates score correlations at a fine-grained level – Very large number of data points, multiple systems – Pearson correlation – Look at metric score variability for MT sentences scored as equally good by humans
33
Types of Machine Translation Unassisted Machine Translation: Unassisted MT takes pieces of text and translates them into output for immediate use with no human involvement The result is unpolished text and gives only a gist of the source text Assisted machine Translation: Assisted MT uses a human translator to clean up after, and sometimes before, translation in order to get better quality results. Usually the process is improved by limiting the vocabulary through use of a dictionary and the types of sentences/grammar allowed.
34
Assisted Machine Translation Assisted MT can be divided into Human Aided Machine Translation (HAMT), a machine that uses human help, and Machine Aided Human Translation (MAHT). Computer Aided Translation (CAT) is a more recent form of MAHT.
35
Types of Translation COMPUTER- AIDED MT Computer Assisted Translation Human Assisted MT
36
Computer-assisted translation Computer-assisted translation (CAT), also called computer- aided translation or machine-aided human translation (MAHT), is a form of translation wherein a human translator creates a target text with the assistance of a computer program. The machine supports a human translator. Computer-assisted translation can include standard dictionary and grammar software. It normally refers to a range of specialized programs available to the translator, including translation-memory, terminology- management, concordance, and alignment programs.
37
Translation Technology tools mean…. Advanced Word Processing functions – complex file management – understanding and manipulating complex file types – multilingual word processing – configuring templates and using tools such as using AutoText. Information mining on the WWW Translation Memory Systems Machine Translation
38
Translation Technology tools involves…. Terminology Extraction and Management Software and Web Localisation Image Editing (Photoshop) Corpora and Text Alignment XML and the Localisation Process…. and much, much more.
39
Translation Memory Translation memory software stores matching source and target language segments that were translated by a translator in a database for future reuse Newly encountered source language segments are compared to the database content, and the resulting output (exact, fuzzy or no match) is reviewed by the translator
40
Translation Memory (TM) Human translator works on Text A database Human translator works on Text B
41
Translation Memory (TM) tools DejaVu TRADOS SDLX Star Transit.
42
Professional Translation can be done … In principle, in any authoring editor (desktop/browser) – However, with limited productivity (in the range 800-1500 words per day) and high efforts maintaining consistency and accuracy. Using Microsoft Word + Plug-ins – Plug-in to translation productivity tool – Hard dealing with structured content Using a Dedicated Translation Editor (CAT) – Depending on various factors: productivity boost in the range 2000 to 5000 words per day – Well established market for professionals
43
Key Productivity Accelerators
44
Productivity Accelerators – “Don’t translate if it hasn’t changed” (but show it to provide context for the text that has actually changed/added) – “Don’t re-translate if you can reuse an (approved) existing translation” (but adapt as you need) – “Adapt an automated translation proposal” (instead of translating from scratch) – “Auto-propagate translations for identical source segments” (and ripple through any changes when you change your translation) – “While I type, provide a list of relevant candidates so that I can quickly auto-complete this part of my translation’” – “While I type, make it easy for me to place tags, recognised terms and other placeables so I can focus on the translatable text.’” – “Make it easy for me to search through Translation Memories, in both source or target text and from wherever I am in the document I’m translating’”
45
Great! See you next time!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.