Download presentation
Presentation is loading. Please wait.
Published bySierra Upshaw Modified over 9 years ago
1
Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen) http://www.csd.abdn.ac.uk/~ereiter
2
Dr. Ehud Reiter, Computing Science, University of Aberdeen2 Contents l General Comments l Geneval proposal
3
Dr. Ehud Reiter, Computing Science, University of Aberdeen3 Good points of Shared Task l Compare different approaches l Encourage people to interact more l Reduce NLG “barriers to entry” l Better understanding of evaluation
4
Dr. Ehud Reiter, Computing Science, University of Aberdeen4 Bad Points l May narrow focus of community »IR ignored web search because of TREC? l May encourage incremental research instead of new ideas
5
Dr. Ehud Reiter, Computing Science, University of Aberdeen5 My opinion l Lets give it a try l But I suspect one-off exercises are better than a series »Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time
6
Dr. Ehud Reiter, Computing Science, University of Aberdeen6 Practical Issues l Domain/task? »Need something which several (6?) group are interested in l Evaluation technique »Avoid techniques that are biased –Eg, some automatic metrics may favour stat systems
7
Dr. Ehud Reiter, Computing Science, University of Aberdeen7 Geneval l Proposal to evaluate NLG evaluation »Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate »Anja Belz and Ehud Reiter »Hope to submit to EPSRC (roughly similar to NSF in US) soon
8
Dr. Ehud Reiter, Computing Science, University of Aberdeen8 NLG Evaluation l Many types »Task-based, human ratings, BLEU-like metrics, etc l Little consensus on best technique »Ie, most appropriate for a context l Poorly understood
9
Dr. Ehud Reiter, Computing Science, University of Aberdeen9 Some open questions l How well do diff types correlate? »Eg, does BLEU predict human ratings? l Are there biases? »Eg, are statistical NLG systems over/under rated by some techniques? l What is best design? »Number of subjects, subject expertise, number (quality) of reference texts, etc
10
Dr. Ehud Reiter, Computing Science, University of Aberdeen10 Belz and Reiter (2006) l Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics l Found OK (not wonderful) correlation, but also some biases l Geneval: do this on a much larger scale »More domains, more systems, more evaluation techniques (including new ones), etc
11
Dr. Ehud Reiter, Computing Science, University of Aberdeen11 Geneval: Possible Domains l Weather forecasts (not wind statements) »Use SumTime corpus l Referring expressions »Use Prodigy-Grec or Tuna corpus l Medical summaries »Use Babytalk corpus l Statistical summaries »Use Atlas corpus
12
Dr. Ehud Reiter, Computing Science, University of Aberdeen12 Geneval: Evaluation techniques l Human task-based »Eg, referential success l Human ratings »Likert vs pref; expert vs non-expert l Automatic metrics based on ref texts »BLEU, ROUGE, METEOR, etc l Automatic metrics without ref texts »MT T and X scores, length
13
Dr. Ehud Reiter, Computing Science, University of Aberdeen13 Geneval: new techniques l Would also like to explore and develop new evaluation techniques »Post-edit based human evaluations? »Automatic metrics which look at semantic features? »Open to suggestions for other ideas!
14
Dr. Ehud Reiter, Computing Science, University of Aberdeen14 Would like systems contributed l Study would be better if other people would contribute systems »We supply data sets and corpora, and carry out evaluations »So you can focus 100% on your great new algorithmic ideas!
15
Dr. Ehud Reiter, Computing Science, University of Aberdeen15 Geneval from STEC perspect l Sort of like STEC??? »If people contribute systems based on our data sets and corpora »But results will be anonymised –only developer of system X knows how well X did »One-off exercises, not repeated »Multiple evaluation techniques l Hope data sets will reduce barriers to entry
16
Dr. Ehud Reiter, Computing Science, University of Aberdeen16 Geneval l Please let Anja or I know if »You have general comments, and/or »You have a suggestion for an additional evaluation technique »You might be interested in contributing a system
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.