An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh.

An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh

1. All that Glitters is not Gold Evaluation requires a gold standard, i.e., clearly specified input/output pairs Does this make sense for NLG? For most NLG tasks, there is no one right answer (Walker, LREC 2005) Any output that allows the user to successfully perform the task is acceptable “Using human outputs assumes they are properly geared to the larger purpose the outputs are meant for. ” (KSJ, p.c.)

2.What’s good for the goose… Most important criterion is “fitness for purpose” Can’t compare output of systems designed for different purposes NLG systems (unlike parsing and MT?) serve a wide range of functions

3. Don’t count on metrics Summarization and MT communities are questioning the usefulness of their shared metrics BLEU does not correlate with human judgements of translation quality (Callison-Burch et al. EACL 2006) BLEU should only be used to compare versions of the same system (Knight, EACL 2006 invited talk) Will nuggets of pyramids topple over?

4. What’s the input? There is no agreed input for any stage of the NLG pipeline Or even where the NLG problem starts, e.g., weather report generation Is input raw weather data or significant events determined by weather analysis program? Weather forecasting not part of the NLG problem! But, quality of the text depends on quality of the data analysis!

5. What to standardize/evaluate? Realization (for example) Should input contain rhetorical goals (a la Hovy) information structure Should output contain prosodic markup

6. Plug and play delusion Requires agreeing on interfaces at each stage of the pipeline Not, “it’s gonna be XML” Must define representations to be passed and their semantics (a la RAGS)

7. Who will pay the piper? It’s pretty clear why DARPA pays for ASR, MT, Summarization, TDT, TREC, etc. What’s the killer-app for NLG? The fact that NSF is holding this workshop and consulting the research community is a very good sign

8. Stifling Science To push this forward, we have to agree on the input (and interfaces) Whatever we agree on will limit the phenomena we study and the theories we can test E.g., SPUD Hard to find a task the allows study of all phenomena community is interested in E.g., MapTask

What are we evaluating? Is the text (speech) generated Grammatical? Natural? Easy to comprehend? Memorable? Suitable to enable user to achieve their intended purpose?

Recommendations Must be clear about who is going to learn what from the (very large) effort Task chosen must: be realistic, i.e., reflect how effective text (or speech) generated is to enable user to achieve their purpose inform NLG research, i.e., help us learn things that enable development of better systems

Evaluation competition for NLG? Donia Scott and Johanna Moore Evaluation is (obviously!) important but …doing this properly in NLG is very hard The progress of the field is (obviously!) important but …NLG has always lagged behind NLU — all signs point to this gap widening There is no evidence to suggest that an evaluation competition, as described by A&E, would be a remedy to either problem … could even be further damaging Read our paper-ette…

How can we progress the field? Very careful consideration given to why NLG is in decline Science: few people tackling the hard theoretical problems Applications: no killer app has yet emerged for most current applications, NLG component only an engineering problem Very careful consideration given to how to best… meet the real evaluation needs of the field enable sharing/re-use of data and components progress the science Do better science! Where possible/suitable, use RAGS and take it further Evaluation competition? notion of ‘task’ as conceived is simplistic/shallow who is supposed to learn from this?? Not enough to say that we’ll learn about NLG field will need to be in much better shape to benefit from such a beast

Thank You!

An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh.

Similar presentations

Presentation on theme: "An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh.

Similar presentations

Presentation on theme: "An Evaluation Competition? Eight Reasons to be Cautious Donia Scott Open University & Johanna Moore University of Edinburgh."— Presentation transcript:

Similar presentations

About project

Feedback