Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Strategies to Measure Student Writing Skills in Your Disciplines Joan Hawthorne University of North Dakota.
Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.
Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Machine Translation Anna Sågvall Hein Mösg F
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Search Engines and Information Retrieval
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Report on Tablet PC use on A815 MA English Hannah Lavery, Open University R07.
Orange: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin & Franz Josef Och (presented by Bilmes) or Orange: a.
Evaluation Methodologies
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
Basic Science Communication Skills Dr Kate Barry Dept Biological Sciences MQU.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Introduction, Approach and Methodology M. Nur Kholis Setiawan, UIN Sunan Kalijaga Yogyakarta Mobile:
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Annotated Bibliographies
COMPGZ07 Project Management Presentations Graham Collins, UCL
Computer-Aided Language Processing Ruslan Mitkov University of Wolverhampton.
Matthew Snover (UMD) Bonnie Dorr (UMD) Richard Schwartz (BBN) Linnea Micciulla (BBN) John Makhoul (BBN) Study of Translation Edit Rate with Targeted Human.
Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.
Search Engines and Information Retrieval Chapter 1.
A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, E. Stiakakis.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Panel Discussion Part I Methodology Ideas from adult MR brain segmentation are used in neonatal MR brain segmentation. However, additional challenges.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Kyoshiro SUGIYAMA, AHC-Lab., NAIST An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering Kyoshiro Sugiyama, Masahiro.
1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.
How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
© Copyright Pearson Prentice Hall Measurements and Their Uncertainty > Slide 1 of Using and Expressing Measurements A ___________________ is a quantity.
Reviewing the Research of Others RIMC Research Capacity Enhancement Workshops Series : “Achieving Research Impact”
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
3.2 Semantics. 2 Semantics Attribute Grammars The Meanings of Programs: Semantics Sebesta Chapter 3.
Basic Writing Skills Science Workshop 1pm Tuesday March 6 th Department of Biological Sciences.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
Slide 1 of 48 Measurements and Their Uncertainty
An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
LREC Marrakech, May 29, 2008 Question Answering on Speech Transcriptions: the QAST evaluation in CLEF L. Lamel 1, S. Rosset 1, C. Ayache 2, D. Mostefa.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Measurement & Calculations Overview of the Scientific Method OBSERVE FORMULATE HYPOTHESIS TEST THEORIZE PUBLISH RESULTS.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Zaidan & Callison-Burch – Feasbility of “Human” MERT Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University.
Michigan Assessment Consortium Common Assessment Development Series Module 16 – Validity.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Dr.V.Jaiganesh Professor
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
SmaRT Visualization of Legal Rules for Compliance
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
Validity and Reliability II: The Basics
Presentation transcript:

Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen

Bente Maegaard, LREC Evaluation at LREC More than 150 papers were submitted to the Evaluation track, both Written and Spoken This is a significant rise compared to previous years Evaluation as a field is attracting increasing interest. Many papers discuss evaluation methodology, the field is still under development, and the answers to some of the methodological questions are still not known. An example: MT Automatic evaluation Evaluation in Context (task-based, function-based)

Bente Maegaard, LREC Evaluation Written Parsing evaluation 6 Semantics, sense 6 Evaluation methodologies 7 Time annotation 9 MT13 Annotation, alignment, morph.15 Lexica, tools21 QA, IR, IE, summarisation, authoring25 Total 102 Note: These figures may contain papers that were originally in other tracks.

Bente Maegaard, LREC Discussion MT evaluation MT evaluation since 1965 Van Slype: Adequacy, fluency, fidelity, Human evaluation, expensive, time-consuming, problems with counting of errors, objective? Formalising human evaluation, adding e.g. grammaticality Another measure: Cost of post-editing, objective Automatic evaluation: Papineni et al. 2001: BLEU, with various modifications. Expensive to establish the reference translations, after that cheap and fast. However, research shows that this automatic method does not correlate well with human evaluation, also does not correlate with the cost of post-editing etc. Automatic statistical evaluation can probably be used for evaluation of MT for gisting, but it cannot be used for MT for publishing

Bente Maegaard, LREC Digression: Metrics that do not work Why is it so difficult to evaluate MT? Because there is more than one correct answer. And because answers may be more or less correct. Measures like WER are not relevant for MT. Methods relying on a specific number of words in the translation are not OK (if the translation does not have the same number of words as the reference)

Bente Maegaard, LREC Generic Contextual Quality Model (GCQM) Popescu-Belis et al. LREC2006 Building on the same thinking as the FEMTI taxonomy One can only evaluate a system in the context in which it will be used. Quality workshop 27/5: task-based, function-based evaluation. (Huang, Take) Karen Sparck-Jones: ‘the set-up’ So, the understanding that a system can only be reasonably evaluated wrt. a specific task, is accepted Domain-specific vs. general purpose MT

Bente Maegaard, LREC What do we need? When? What? In the field of MT evaluation we need more experiments in order to establish a methodology. The French CESTA (Hamon et al, LREC2006) is a good example. So, we need international cooperation for the infrastructure, but in the first instance this cooperation should lead to reliable metrics for MT evaluation. Later on it may be used for actually measuring MT systems’ performance. (Of course not only MT!) When? As soon as possible. Start with methodology, for each application Move on to doing evaluation Goal: in 2011 we can reliably evaluate MT - and other applications!