An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh.

Slides:



Advertisements
Similar presentations
An Approach to quality: The Quality Mark Helen Allen Operations Manager 26th September.
Advertisements

A didactic plan for a communicative translation class Dr. Constanza Gerding Salas Leipzig Universität - Universidad de Concepción May 2012.
DT Coursework By D. Henwood.
Mywish K. Maredia Michigan State University
System Integration Verification and Validation
Content Introduction Standard Templates 1.1 Background / Overview
Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)
1 To Share a Task or Not: Some Ramblings from a Mad (i.e., crazy) INLGer Kathy McCoy CIS Department University of Delaware.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 Consultation workshop 2015 Cap on care costs and extended means test.
Lecture # 2 : Process Models
Supporting Business Decisions Expert Systems. Expert system definition Possible working definition of an expert system: –“A computer system with a knowledge.
The counterfactual logic for public policy evaluation Alberto Martini hard at first, natural later 1.
Using the Crosscutting Concepts As conceptual tools when meeting an unfamiliar problem or phenomenon.
Theory and Practice in AI and Law: A Response to Branting Katie Atkinson and Trevor Bench-Capon Department of Computer Science The University of Liverpool.
A tool for analyzing visuals, poetry, and nonfiction
Alternate Software Development Methodologies
Evaluation State-of the-art and future actions Bente Maegaard CST, University of Copenhagen
Chapter 10 Schedule Your Schedule. Copyright 2004 by Pearson Education, Inc. Identifying And Scheduling Tasks The schedule from the Software Development.
1 Today’s Plan 900am–915am:Sort out questions regarding refund forms 915am–945am:Finalise today’s agenda 945am-1045am:Breakout Groups Session am–1115am:Coffee.
Supporting Teachers to make Overall Teacher Judgments The Consortium for Professional Learning.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Research Methods in MIS
Chapter 2 DO How can you create a strategic map for your hotel?
Object-Orientated Design Unit 3: Objects and Classes Jin Sa.
Projmgmt-1/22 DePaul University Project Management I - Realistic Scheduling Instructor: David A. Lash.
Lecture (2-1) Interpreting
SystematicSystematic process that translates quality policy into measurable objectives and requirements, and lays down a sequence of steps for realizing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Website Accessibility
Introduction to Usability By : Sumathie Sundaresan.
Overall Teacher Judgements and
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
NSW Curriculum and Learning Innovation Centre Draft Senior Secondary Curriculum ENGLISH May, 2012.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Conversational Applications Workshop Introduction Jim Larson.
Statistical Analysis A Quick Overview. The Scientific Method Establishing a hypothesis (idea) Collecting evidence (often in the form of numerical data)
Introduction Managing time in organizations is difficult because time flows at the same rate for everyone and cannot be 'managed' like other resources.
Ways for Improvement of Validity of Qualifications PHARE TVET RO2006/ Training and Advice for Further Development of the TVET.
The student will demonstrate an understanding of how scientific inquiry and technological design, including mathematical analysis, can be used appropriately.
Debate: Reasoning. Claims & Evidence Review Claims are statements that serve to support your conclusion. Evidence is information discovered through.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
A useful guide from the Chief Examiner. June Paper Q1a) Describe the ways in which your production work was informed by research into real media.
LOGIC AND ONTOLOGY Both logic and ontology are important areas of philosophy covering large, diverse, and active research projects. These two areas overlap.
Literacy I can recall main info, know where to look for it, make inferences linked to evidence, show awareness of characters’ intentions, adapt speech.
Dept. of Computer Science University of Rochester Rochester, NY By: James F. Allen, Donna K. Byron, Myroslava Dzikovska George Ferguson, Lucian Galescu,
The Next Generation Science Standards: 4. Science and Engineering Practices Professor Michael Wysession Department of Earth and Planetary Sciences Washington.
David McDonald BBN Technologies Flexibility counts more than precision.
Mental Preparation for Physical Activities: Goal Setting. Mr. P. Leighton Sports Psychology.
ES component and structure Dr. Ahmed Elfaig The production system or rule-based system has three main component and subcomponents shown in Figure 1. 1.Knowledge.
What is the Best Position? Developing Positioning Skills February, R. Baker1.
1 The Good, the Bad, and the Ugly: Collecting and Reporting Quality Performance Data.
Introduction to Usability By : Sumathie Sundaresan.
 An article review is written for an audience who is knowledgeable in the subject matter instead of a general audience  When writing an article review,
Jette Viethen 20 April 2007NLGeval07 Automatic Evaluation of Referring Expression Generation is Possible.
Basic Business Communications Nick Mercuro, Austin Moore, John Skinner.
Workshop A. Development of complex interventions Rob Anderson, PCMD Nicky Britten, PCMD.
MT Evaluation: What to do and what not to do Eduard Hovy Information Sciences Institute University of Southern California.
Parent Workshop Year 2 Assessment without levels January 2016.
KS2 Parent Workshop Assessment without levels End of KS2 tests
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
3 “Products” of Principle Component Analysis
Software Engineering 2004 Jyrki Nummenmaa 1 Why new software methodologies The classic waterfall-model based techniques are strongly based on the.
Investigate Plan Design Create Evaluate (Test it to objective evaluation at each stage of the design cycle) state – describe - explain the problem some.
BTEC Unit 6 Outcome measures. Objectives To evaluate some outcome measurement forms To identify ways of measuring outcomes effectively To design an outcome.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
1 Testing—tuning and monitoring © 2013 by Larson Technical Services.
SEMANTICS DEFINITION: Semantics is the study of MEANING in LANGUAGE Try to get yourself into the habit of careful thinking about your language and the.
Accounting: Graded Unit 2 F8KF 35
Getting Practical Science transition project
Presentation transcript:

An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

1. All that Glitters is not Gold Evaluation requires a gold standard, i.e., clearly specified input/output pairs Does this make sense for NLG? For most NLG tasks, there is no one right answer (Walker, LREC 2005) Any output that allows the user to successfully perform the task is acceptable “Using human outputs assumes they are properly geared to the larger purpose the outputs are meant for. ” (KSJ, p.c.)

2.What’s good for the goose… Most important criterion is “fitness for purpose” Can’t compare output of systems designed for different purposes NLG systems (unlike parsing and MT?) serve a wide range of functions

3. Don’t count on metrics Summarization and MT communities are questioning the usefulness of their shared metrics BLEU does not correlate with human judgements of translation quality (Callison-Burch et al. EACL 2006) BLEU should only be used to compare versions of the same system (Knight, EACL 2006 invited talk) Will nuggets of pyramids topple over?

4. What’s the input? There is no agreed input for any stage of the NLG pipeline Or even where the NLG problem starts, e.g., weather report generation Is input raw weather data or significant events determined by weather analysis program? Weather forecasting not part of the NLG problem! But, quality of the text depends on quality of the data analysis!

5. What to standardize/evaluate? Realization (for example) Should input contain rhetorical goals (a la Hovy) information structure Should output contain prosodic markup

6. Plug and play delusion Requires agreeing on interfaces at each stage of the pipeline Not, “it’s gonna be XML” Must define representations to be passed and their semantics (a la RAGS)

7. Who will pay the piper? It’s pretty clear why DARPA pays for ASR, MT, Summarization, TDT, TREC, etc. What’s the killer-app for NLG? The fact that NSF is holding this workshop and consulting the research community is a very good sign

8. Stifling Science To push this forward, we have to agree on the input (and interfaces) Whatever we agree on will limit the phenomena we study and the theories we can test E.g., SPUD Hard to find a task the allows study of all phenomena community is interested in E.g., MapTask

What are we evaluating? Is the text (speech) generated Grammatical? Natural? Easy to comprehend? Memorable? Suitable to enable user to achieve their intended purpose?

Recommendations Must be clear about who is going to learn what from the (very large) effort Task chosen must: be realistic, i.e., reflect how effective text (or speech) generated is to enable user to achieve their purpose inform NLG research, i.e., help us learn things that enable development of better systems

Evaluation competition for NLG? Donia Scott and Johanna Moore Evaluation is (obviously!) important but …doing this properly in NLG is very hard The progress of the field is (obviously!) important but …NLG has always lagged behind NLU — all signs point to this gap widening There is no evidence to suggest that an evaluation competition, as described by A&E, would be a remedy to either problem … could even be further damaging Read our paper-ette…

How can we progress the field? Very careful consideration given to why NLG is in decline Science: few people tackling the hard theoretical problems Applications: no killer app has yet emerged for most current applications, NLG component only an engineering problem Very careful consideration given to how to best… meet the real evaluation needs of the field enable sharing/re-use of data and components progress the science Do better science! Where possible/suitable, use RAGS and take it further Evaluation competition? notion of ‘task’ as conceived is simplistic/shallow who is supposed to learn from this?? Not enough to say that we’ll learn about NLG field will need to be in much better shape to benefit from such a beast

Thank You!