How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.

Slides:



Advertisements
Similar presentations
Dr. Stephen Doherty & Dr. Sharon O’Brien
Advertisements

Statistical modelling of MT output corpora for Information Extraction.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
KeTra.
Design, prototyping and construction
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
English only VS. L1 support Looking at English language acquisition of native Spanish speaking children.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.
Eye Tracking Analysis of User Behavior in WWW Search Laura Granka Thorsten Joachims Geri Gay.
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Joke Daems PhD student Lieve Macken, Sonia Vandepitte, Robert Hartsuiker Comparing HT and PE using advanced research tools.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
ACT Science Reasoning Test Prep Opening Questions
User Testing & Experiments. Objectives Explain the process of running a user testing or experiment session. Describe evaluation scripts and pilot tests.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.
Design of metadata surrogates in search result interfaces of learning object repositories: Linear versus clustered metadata design Panos Balatsoukas Anne.
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
1 User Centered Design and Evaluation. 2 Overview My evaluation experience Why involve users at all? What is a user-centered approach? Evaluation strategies.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Internal Consistency Reliability Analysis PowerPoint.
Linda Mitchell Evaluating Community Post-Editing - Bridging the Gap between Translation Studies and Social Informatics Linda Mitchell PhD student.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
MonoTrans2: A New Human Computation System to Support Monolingual Translation Chang Hu, Benjamin B. Bederson, Philip Resnik and Yakov Kronrod Translating.
© Intercultural Studies Group Universitat Rovira i Virgili Plaça Imperial Tàrraco Tarragona Fax: (++ 34) Contributions from process.
Effect of Text font, line length and language on online information search Hang Yu Human Centered Design and Engineering University of Washington.
1 Copyright © 2011 by Saunders, an imprint of Elsevier Inc. Chapter 9 Examining Populations and Samples in Research.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Silke Gutermuth & Silvia Hansen-Schirra University of Mainz Germany Post-editing machine translation – a usability test for professional translation settings.
Experimental Research Methods in Language Learning Chapter 2 Experimental Research Basics.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
The effects of relevance of on-screen information on gaze behaviour and communication in 3-party groups Emma L Clayes University of Glasgow Supervisor:
Riverside County Assessment Network CCSS SBAC Update.
Hao Wu Nov Outline Introduction Related Work Experiment Methods Results Conclusions & Next Steps.
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
© Intercultural Studies Group Universitat Rovira i Virgili Plaça Imperial Tàrraco Tarragona Fax: (++ 34) Recent trends in Translation.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Company Logo Professor: Liu student: Ruby The role of working memory, field dependence, visual search, and reaction time in the left turn performance of.
Gaze-Tracked Crowdsourcing Jakub Šimko, Mária Bieliková
The Critical Period for Language Acquisition: Evidence from Second Language Learning CATHERINE E. SNOW AND MARIAN HOEFNAGEL-HÖHLE UNIVERSITY OF AMSTERDAM.
Begin Class with More Studio. Introduction to Prototyping.
Counting How Many Words You Read
Learn Spanish (Hello-Hello) App Julia Snyder. 9 th grade- Simple Spanish Phrases Subject Area 12: World Languages –Standard Area 12.1: Communication in.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Design, prototyping and construction(Chapter 11).
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
How Computers Solve Problems Computers also use Algorithms to solve problems, and change data into information Computers can only perform one simple step.
Is Neural Machine Translation the New State of the Art?
Human Computer Interaction Lecture 21 User Support
Language Technologies Institute Carnegie Mellon University
Human Computer Interaction Lecture 21,22 User Support
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Joint Training for Pivot-based Neural Machine Translation
Presentation by Hanh Dinh and Beverly Beaudette
Collaboration with Google Drive
Instructional Practices in the Early Grades that Foster Language & Comprehension Development Timothy Shanahan University of Illinois at Chicago
Design, prototyping and construction
IUED Institute of Translation and Interpreting
Place Value, Addition and Subtraction
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
HCI Evaluation Techniques
The Winograd Schema Challenge Hector J. Levesque AAAI, 2011
Meta-analysis, systematic reviews and research syntheses
Design, prototyping and construction
Presentation transcript:

How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel

2 Machine Translation Evaluation Task: Compare a system’s output to a source sentence and reference sentence The evaluation can be based on: Fluency and adequacy Ranking

3 Example: Croácia sem capacidade para receber mais refugiados Source Croatia without capacity to accept any more refugees Reference Croatia without capacity to receive more refugees Translation

4 Common Issues The task side Difficult to distinguish between different aspects of translation such as grammar, syntax, etc. No insight on how humans use different sources of information The human side Tedious task: evaluators can lose interest quickly Evaluations are inconsistent and highly subjective

5 Score/ Ranking Translations Source References Black Box

6 Study MT Evaluation as a Glass-Box Sources of information Source text Translation Reference Background of evaluators Monolingual (English speakers) Bilingual (Spanish+English speakers) Comparison Time to complete the task Consistency Using eye tracking Monolingual: Knowledge of Spanish insufficient to understand the source text Loose definition

7 Questions to Answer What is the most useful information for the evaluators? Where do users spend more time? Evaluating the translation? Understanding the task? What is the effect of evaluator’s background? Which evaluators are more consistent?

8 Experimental Setup

9 Participants 85% Computer scientists 50% Had experience translating documents 20% Had experience with MT evaluation 20 Evaluators 27 to 45 yrs old None of the co-authors of this paper were in the pool of subjects

10 Interface 3 AOIs (Areas of Interest) Source (src) Translation Reference (tgt) Scorer

11 Eye-tracking Setup

12 Data Spanish to English WMT12 evaluation data 3003 different source sentences, and references 12 different MT systems 1141 ranking annotations In each annotation, evaluators ranked 5 out of the 12 systems Selected sentences with most judgments 150 source sentences were chosen Selected 2 translations per sentence (300 total) Best and worst relative ranking Created 4 tasks per translation (1200 total)

13 Balanced experimental design 1200 evaluation tasks (300 translations x 4 replicates) UsersScenariosSentence length (~10.18) (~30.88 words) (~18.18)

14 Experimental Design (per user) 60 evaluation tasks Task scenarioTask sentence length A user never saw the same translation (or related translations) twice

15 How to Compare the Performance? Time Measure the effective time (i.e. time spent in different areas) Score consistency (variance) Variability of the scores given by different evaluators (C) to the same translation, averaged across all translations (T)

16 Score Consistency Low consistency High consistency Group averages Group A Group B | 100 |0|0 | 100 |0|0

17 Results

18 Time Spent in Each Scenario Bilinguals are faster than monolinguals Showing only the reference (tgt) is the fastest condition Bilinguals spend about the same time in src vs. src+tgt Key: source only (src), source and reference (src+tgt), reference only (tgt)

19 Time vs. Source Sentence Length Bilinguals are faster (again!) Longer sentences take longer (expected)

20 Where do Evaluators Look? Most of the time reading source or reference Bilinguals spend less time reading the translation Monolinguals spend a large proportion of time in source Alternative strategies? Spanish-English relatedness?

21 Consistency Monolinguals show less variance than bilinguals Monolinguals are most consistent in the tgt scenario Bilinguals show more variance in tgt scenario Possible lack of familiarity

22 Summary The bilinguals perform the tasks faster. The bilinguals spend less time evaluating the translation. The monolinguals are slower, but more consistent. The more information is displayed in the screen, the longer it takes to complete the evaluation.

23 For human MT evaluation tasks Scenario: only target language information (reference) Evaluators: monolinguals For human MT evaluation tasks Scenario: only target language information (reference) Evaluators: monolinguals Our Recommendation

24 A grain of salt Need to randomize the order of the scenarios Replication is necessary in other languages Does this hold for ranking-based evaluations?

25 Thank you! Questions? Data: github.com/Qatar-Computing-Research-Institute/wmt15eyetracking Eye-tracking enabled Appraise: github.com/Qatar-Computing-Research-Institute/iAppraise

26 Why are Monolinguals More Consistent? We can only speculate: - Bilinguals could formulate their own set of plausible translations making it more subjective - Monolinguals are constrained to a concrete representation faster, more subjective slower, less subjective

27 About Bilinguals in a Monolingual Setting Why are there differences? Hypothesis: order of the tasks encourages evaluators to learn different strategies

28 Do Evaluators Get Faster With Time? Low correlation between task order and time to complete src scenario src + tgt scenario tgt scenario

29 Feedback Gamification of evaluation process Provided feedback in form of “stars” Feedback based on: - WMT human judgments - DiscoTK-party metric (winner of WMT14 metrics task) Stars computed as relative distance to gold-score

30 Related work: Eyetracking Doherty et al. (2010): Use eye-tracking to evaluate the comprehensibility of MT output in French Stymne et al (2012): Apply eye-tracking to MT error analysis Doherty and O’Brien (2014): Use eye-tracking to evaluate the usability of MT by an end user Michael Carl : Eye-tracking in the translation process, post- editing

31 Evaluation Only one translation at the time Continuous score, “more natural” Instructions were kept simple