1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.

Slides:



Advertisements
Similar presentations
Project Based Learning
Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Lecture 3 Validity of screening and diagnostic tests
Psychometric Aspects of Linking Tests to the CEF Norman Verhelst National Institute for Educational Measurement (Cito) Arnhem – The Netherlands.
Chapter 4 – Reliability Observed Scores and True Scores Error
MEQ Analysis. Outline Validity Validity Reliability Reliability Difficulty Index Difficulty Index Power of Discrimination Power of Discrimination.
I Evaluation of Free Online Machine Translations for Croatian-English and English-Croatian Language Pairs Sanja Seljan,
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Collecting Highly Parallel Data for Paraphrase Evaluation David L. Chen The University of Texas at Austin William B. Dolan Microsoft Research The 49th.
Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen
1 Ch. 3: Interaction Introduction – 3.1 (Reading Assignment – RA) Introduction – 3.1 (Reading Assignment – RA) Models – 3.2, 3.3 (RA) Models – 3.2, 3.3.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Software Testing in the Cloud Leah Riungu-Kalliosaari.
Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
Predicting Cloze Task Quality for Vocabulary Training Adam Skory, Maxine Eskenazi Language Technologies Institute Carnegie Mellon University
Software Testing Using Model Program DESIGN BY HONG NGUYEN & SHAH RAZA Dec 05, 2005.
Methodologies for Evaluating Dialog Structure Annotation Ananlada Chotimongkol Presented at Dialogs on Dialogs Reading Group 27 January 2006.
Crowdsourcing research data UMBC ebiquity,
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
HCI revision lecture. Main points Understanding Applying knowledge Knowing key points Knowing relationship between things If you’ve done the group project.
System Implementations American corporations spend about $300 Billion a year on software implementation/upgrade projects.
Chapter 3 Methods for Recording Behavior EDP 7058.
Linda Mitchell Evaluating Community Post-Editing - Bridging the Gap between Translation Studies and Social Informatics Linda Mitchell PhD student.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Psychometrics Timothy A. Steenbergh and Christopher J. Devers Indiana Wesleyan University.
Bringing the crowdsourcing revolution to research in communication disorders Tara McAllister Byun, PhD, CCC-SLP Suzanne M. Adlof, PhD Michelle W. Moore,
Assessing Critical Thinking Skills Dr. Barry Stein - Professor of Psychology, Director of Planning, Coordinator of TTU Critical Thinking Initiative Dr.
Conducting a User Study Human-Computer Interaction.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Evaluating the Validity of NLSC Self-Assessment Scores Charles W. Stansfield Jing Gao Bill Rivers.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Interaction Design Chapter 10. The Human Action Cycle Psychological model Describes steps users take to interact with computer systems Use actions and.
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
1 Chapter 4 – Reliability 1. Observed Scores and True Scores 2. Error 3. How We Deal with Sources of Error: A. Domain sampling – test items B. Time sampling.
New Advanced Higher Subject Implementation Events Computing Science Advanced Higher Course Assessment.
Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi
Reliability & Agreement DeShon Internal Consistency Reliability Parallel forms reliability Parallel forms reliability Split-Half reliability Split-Half.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Crowdsourcing a News Query Classification Dataset Richard McCreadie, Craig Macdonald & Iadh Ounis 0.
Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.
An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh.
Evaluation Methods - Summary. How to chose a method? Stage of study – formative, iterative, summative Pros & cons Metrics – depends on what you want to.
Test status report Test status report is important to track the important project issues, accomplishments of the projects, pending work and milestone analysis(
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Objective: Enabling students to translate from English into Arabic and vice versa. Why teach translation: It develops accuracy, fluency, clarity, and.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
Zaidan & Callison-Burch – Feasbility of “Human” MERT Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University.
Part 4: Evaluation Days 25, 27, 29, 31 Chapter 20: Why evaluate? Chapter 21: Deciding on what to evaluate: the strategy Chapter 22: Planning who, what,
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.
1 Hitting The Right Paraphrases In Good Time Stanley Kok Dept. of Comp. Sci. & Eng. Univ. of Washington Seattle, USA Chris Brockett NLP Group Microsoft.
Chapter 8 The Social Enterprise: From Recruiting to Problem Solving and Collaboration.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Oral Health Training & Calibration Programme
An Alternative Certification Examination “ACE”, to assess the domains of professional practice.  Morris M1*, Gillis AE2, Smoothey CO3, Hennessy M1, Conlon.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Design of Expert Systems
Sparkle a functional theorem prover
Lecture 15: Technical Metrics
CrowdDB : Answering queries with Crowdsourcing
Homework questions How does ACTFL define a beginning level learner? (p.30) What are the principles for teaching speaking to beginning learners? (pp.36-40)
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Collecting and Interpreting Quantitative Data – Introduction (Part 1)
Sociology Outcomes Assessment
Analyzing Reliability and Validity in Outcomes Assessment
Presentation transcript:

1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics for Operational Translation Evaluation, LREC 2014

2 Goal o Test crowdsourcing to evaluate our spoken language translation system MedSLT  ENG-SPA language combination o Comparing effort (time-cost) using Amazon Mechanical Turk  vs classic in-house human evaluation  vs BLEU (no high correlation in previous work, time to produce references)

3 Experiment in 3 stages o Tailor-made metric by in-house evaluators o Amazon Mechanical Turk – pilot study:  Feasibility  Time  Cost  Achieving inter-rater agreement comparable to expert evaluators ? o AMT application phase 2: how many evaluations needed

4 Tailor-made metric - TURKoise o CCOR (4): The translation is completely correct. All the meaning from the source is present in the target sentence. o MEAN (3): The translation is not completely correct. The meaning is slightly different but it represents no danger of miscommunication between doctor and patient. o NONS (2): This translation doesn't make any sense, it is gibberish. This translation is not correct in the target language. o DANG (1): This translation is incorrect and the meaning in the target and source are very different. It is a false sense, dangerous for communication between doctor and patient.

5 AMT evaluation - facts o Set-up  Creating evaluation interface  Preparing data  Selection phase o Response time and costs  Cost per HIT (20 sentences) = 0.25 $  approx. 50 $  Time – 3 days (pilot)

6 AMT Tasks o Selection task : subset of fluency o Fluency o Adequacy o TURKoise  Total amount of 145 HITS - 20 sentences each -> every 222 sentences of the corpus evaluated 5 times for each task

7 Interface for the AMT worker

8 Crowd selection o Selection task:  HIT of 20 sentences for which in house evaluators achieved 100% agreement -> gold standard  Qualification assignment o Time to recruit:  within 24h. = 20 workers were selected o Accept rate: 23/30 qualified workers o Most of the final HITS achieved by 5 workers

9 Pilot results for TURKoise In-house vs AMT TURKoiseIn-houseAMT Unanimous15%32% 4 agree35%26% 3 agree42%37% majority92%95% Fleiss Kappa

10 Phase 2 : Does more equal better o How many evaluations needed ? Compared in terms of Fleiss Kappa Number of eval. FluencyAdequacyTURKoise 3-times AMT times AMT times AMT inhouse

11 Conclusion o Success in setting-up AMT based evaluation in terms of:  time and cost  number of recruited AMT workers in a short time  recruitment of reliable evaluators for a bilingual task  agreement achieved by AMT workers comparable to in-house evaluators  without recruiting a huge crowd

12 Further discussion o Difficult to assess agreement:  Percentage of agreement  Kappa  Not easy to interpret  Not best suited for multi-rater and prevalence in data  Interclass correlation coefficient – ICC (Hallgren, 2012) o AMT – not globally accessible  Any experience with Crowdflower ?

13 References o Callison-Burch, C. (2009). Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon's Mechanical Turk. Proceedings of the 2009 Empirical Methods in Natural Language Processing (EMNLP), Singapore, pp o Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8, pp. 23–34.