Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)

Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen) http://www.csd.abdn.ac.uk/~ereiter

Dr. Ehud Reiter, Computing Science, University of Aberdeen2 Contents l General Comments l Geneval proposal

Dr. Ehud Reiter, Computing Science, University of Aberdeen3 Good points of Shared Task l Compare different approaches l Encourage people to interact more l Reduce NLG “barriers to entry” l Better understanding of evaluation

Dr. Ehud Reiter, Computing Science, University of Aberdeen4 Bad Points l May narrow focus of community »IR ignored web search because of TREC? l May encourage incremental research instead of new ideas

Dr. Ehud Reiter, Computing Science, University of Aberdeen5 My opinion l Lets give it a try l But I suspect one-off exercises are better than a series »Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time

Dr. Ehud Reiter, Computing Science, University of Aberdeen6 Practical Issues l Domain/task? »Need something which several (6?) group are interested in l Evaluation technique »Avoid techniques that are biased –Eg, some automatic metrics may favour stat systems

Dr. Ehud Reiter, Computing Science, University of Aberdeen7 Geneval l Proposal to evaluate NLG evaluation »Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate »Anja Belz and Ehud Reiter »Hope to submit to EPSRC (roughly similar to NSF in US) soon

Dr. Ehud Reiter, Computing Science, University of Aberdeen8 NLG Evaluation l Many types »Task-based, human ratings, BLEU-like metrics, etc l Little consensus on best technique »Ie, most appropriate for a context l Poorly understood

Dr. Ehud Reiter, Computing Science, University of Aberdeen9 Some open questions l How well do diff types correlate? »Eg, does BLEU predict human ratings? l Are there biases? »Eg, are statistical NLG systems over/under rated by some techniques? l What is best design? »Number of subjects, subject expertise, number (quality) of reference texts, etc

Dr. Ehud Reiter, Computing Science, University of Aberdeen10 Belz and Reiter (2006) l Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics l Found OK (not wonderful) correlation, but also some biases l Geneval: do this on a much larger scale »More domains, more systems, more evaluation techniques (including new ones), etc

Dr. Ehud Reiter, Computing Science, University of Aberdeen11 Geneval: Possible Domains l Weather forecasts (not wind statements) »Use SumTime corpus l Referring expressions »Use Prodigy-Grec or Tuna corpus l Medical summaries »Use Babytalk corpus l Statistical summaries »Use Atlas corpus

Dr. Ehud Reiter, Computing Science, University of Aberdeen12 Geneval: Evaluation techniques l Human task-based »Eg, referential success l Human ratings »Likert vs pref; expert vs non-expert l Automatic metrics based on ref texts »BLEU, ROUGE, METEOR, etc l Automatic metrics without ref texts »MT T and X scores, length

Dr. Ehud Reiter, Computing Science, University of Aberdeen13 Geneval: new techniques l Would also like to explore and develop new evaluation techniques »Post-edit based human evaluations? »Automatic metrics which look at semantic features? »Open to suggestions for other ideas!

Dr. Ehud Reiter, Computing Science, University of Aberdeen14 Would like systems contributed l Study would be better if other people would contribute systems »We supply data sets and corpora, and carry out evaluations »So you can focus 100% on your great new algorithmic ideas!

Dr. Ehud Reiter, Computing Science, University of Aberdeen15 Geneval from STEC perspect l Sort of like STEC??? »If people contribute systems based on our data sets and corpora »But results will be anonymised –only developer of system X knows how well X did »One-off exercises, not repeated »Multiple evaluation techniques l Hope data sets will reduce barriers to entry

Dr. Ehud Reiter, Computing Science, University of Aberdeen16 Geneval l Please let Anja or I know if »You have general comments, and/or »You have a suggestion for an additional evaluation technique »You might be interested in contributing a system

Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)

Similar presentations

Presentation on theme: "Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)

Similar presentations

Presentation on theme: "Dr. Ehud Reiter, Computing Science, University of Aberdeen1 NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen)"— Presentation transcript:

Similar presentations

About project

Feedback